I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs
worth for a few million of them.
Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a
decompress error due to a truncated stream
,
which I believe is due to the compressed strings containing new lines.
Anyone know of a good method to do this?
You might want to try an incremental json parser, such as jsaone.
That is, create a single json with all your objects, and parse it like
This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.
EDIT: oh, by the way, it's probably fair to clarify that wrote jsaone.
Just use a
gzip.GzipFile()
object and treat it like a regular file; write JSON objects line by line, and read them line by line.The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.
This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.