I need to analyze a large data set that is distributed as a lz4 compressed JSON file.
The compressed file is almost 1TB. I'd prefer not to uncompress it to disk due to cost. Each "record" in the dataset is very small, but it is obviously not feasible to read the entire data set into memory.
Any advice on how to iterate through records in this large lz4 compressed JSON file in Python 2.7?
As of version 0.19.1 of the python lz4 bindings, there is full support for buffered IO provided. So, you should be able to do something like:
import lz4.frame
chunk_size = 128 * 1024 * 1024
with lz4.frame.open('mybigfile.lz4', 'r') as file:
chunk = file.read(size=chunk_size)
# Do stuff with this chunk of data.
which will read in data from the file at around 128 MB at a time.
Aside: I am the maintainer of the python lz4 package - please do file issues on the project page if you have problems with the package, or if something is not clear in the documentation.