I have some json files with 500MB. If I use the "trivial" json.load to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
Any suggestions? Thanks
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
where
prefix
is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...),theType
describes a SAX-like event, one of'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array'
, andvalue
is the value of the object orNone
ifthe_type
is an event like starting/ending a map/array.The project has some docstrings, but not enough global documentation. I had to dig into
ijson/common.py
to find what I was looking for.Another idea is to try load it into a document-store database like MongoDB. It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/