How to read line-delimited JSON from large file &#

2019-04-03 08:40发布

问题:

I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:

{
    "key11": value11,
    "key12": value12,
}
{
    "key21": value21,
    "key22": value22,
}
…

The way I'm importing it now is:

content = open(file_path, "r").read() 
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")

Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).

Is there a better way to specify the JSON delimiter (newline \n instead of comma ,)?

Also, Python can't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSON object as I'm reading the file line by line? Thanks!

回答1:

Just read each line and construct a json object at this time:

with open(file_path) as f:
    for line in f:
        j_content = json.loads(line)

This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.

There is also this answer.:

https://stackoverflow.com/a/7795029/671543



回答2:

This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.

{
    "key11": 11,
    "key12": 12
}
{
    "key21": 21,
    "key22": 22
}

Just read line-by-line, and build the JSON blocks as you go:

with open(args.infile, 'r') as infile:

    # Variable for building our JSON block
    json_block = []

    for line in infile:

        # Add the line to our JSON block
        json_block.append(line)

        # Check whether we closed our JSON block
        if line.startswith('}'):

            # Do something with the JSON dictionary
            json_dict = json.loads(''.join(json_block))
            print(json_dict)

            # Start a new block
            json_block = []

If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.



回答3:

contents = open(file_path, "r").read() 
data = [json.loads(str(item)) for item in contents.strip().split('\n')]