Is there a memory efficient and fast way to load b

I have some json files with 500MB. If I use the "trivial" json.load to load its content all at once, it will consume a lot of memory.

Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.

Any suggestions? Thanks

标签： python json large-files

8条回答

人间绝色

2楼-- · 2019-01-02 15:45

On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.

0人赞添加讨论(0) 举报

谁念西风独自凉

3楼-- · 2019-01-02 15:46

Short answer: no.

Properly dividing a json file would take intimate knowledge of the json object graph to get right.

However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.

For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.

You would have to do some string content parsing to get the chunking of the json file right.

I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.

0人赞添加讨论(0) 举报

何处买醉

4楼-- · 2019-01-02 15:51

"the garbage collector should free the memory"

Correct.

Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.

Remove all global variables.

Make all module-level code into smaller functions.

0人赞添加讨论(0) 举报

谁念西风独自凉

5楼-- · 2019-01-02 15:53

Yes.

You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.

0人赞添加讨论(0) 举报

骚的不知所云

6楼-- · 2019-01-02 15:59

So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:

Modularize your code. Do something like:
```
for json_file in list_of_files:
    process_file(json_file)
```
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via subprocess.Popen. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.

Hope this helps.

0人赞添加讨论(0) 举报

呛了眼睛熬了心

7楼-- · 2019-01-02 15:59

in addition to @codeape

I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what @codeape suggests - break the file up into smaller chunks, etc

0人赞添加讨论(0) 举报

1 2 下一页

Is there a memory efficient and fast way to load b

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间