Python json memory bloat

2019-05-01 04:08发布

问题:

import json
import time
from itertools import count

def keygen(size):
    for i in count(1):
        s = str(i)
        yield '0' * (size - len(s)) + str(s)

def jsontest(num):
    keys = keygen(20)
    kvjson = json.dumps(dict((keys.next(), '0' * 200) for i in range(num)))
    kvpairs = json.loads(kvjson)
    del kvpairs # Not required. Just to check if it makes any difference                            
    print 'load completed'

jsontest(500000)

while 1:
    time.sleep(1)

Linux top indicates that the python process holds ~450Mb of RAM after completion of 'jsontest' function. If the call to 'json.loads' is omitted then this issue is not observed. A gc.collect after this function execution does releases the memory.

Looks like the memory is not held in any caches or python's internal memory allocator as explicit call to gc.collect is releasing memory.

Is this happening because the threshold for garbage collection (700, 10, 10) was never reached ?

I did put some code after jsontest to simulate threshold. But it didn't help.

回答1:

Put this at the top of your program

import gc
gc.set_debug(gc.DEBUG_STATS)

and you'll get printed output whenever there's a collection. You'll see that in your example code there is no collection after jsontest completes, until the program exits.

You can put

print gc.get_count()

to see the current counts. The first number is the excess of allocations over deallocations since the last collection of generation 0; the second (resp. third) is the number of times generation 0 (resp. 1) has been collected since the last collection of generation 1 (resp. 2). If you print these immediately after jsontest completes you'll see that the counts are (548, 6, 0) or something similar (no doubt this varies according to Python version). So the threshold was not reached and no collection took place.

This is typical behaviour for threshold-based garbage collection scheduling. If you need free memory to be returned to the operating system in a timely manner, then you need to combine threshold-based scheduling with time-based scheduling (that is, request another collection after a certain amount of time has passed since the last collection, even if the threshold has not been reached).