Looping over npz files load causes memory overflow (depending on the file list length).
None of the following seems to help
Deleting the variable which stores the data in the file.
Using mmap.
calling gc.collect() (garbage collection).
The following code should reproduce the phenomenon:
import numpy as np
# generate a file for the demo
X = np.random.randn(1000,1000)
np.savez('tmp.npz',X=X)
# here come the overflow:
for i in xrange(1000000):
data = np.load('tmp.npz')
data.close() # avoid the "too many files are open" error
in my real application the loop is over a list of files and the overflow exceeds 24GB of RAM! please note that this was tried on ubuntu 11.10, and for both numpy v 1.5.1 as well as 1.6.0
I have filed a report in numpy ticket 2048 but this may be of a wider interest and so I am posting it here as well (moreover, I am not sure that this is a bug but may result of my bad programming).
SOLUTION (by HYRY):
the command
del data.f
should precede the command
data.close()
for more information and a method to find the solution, please read HYRY's kind answer below
I think this is a bug, and maybe I found the solution: call "del data.f".
to found this kind of memory leak. you can use the following code:
After the test program, I created a dict and count objects in gc.get_objects(). Here is the output:
From the result we know that there are something wrong with BagObj and NpzFile. Find the code:
NpzFile has del(), NpzFile.f is a BagObj, and BagObj._obj is NpzFile, this is a reference cycle and will cause both NpzFile and BagObj uncollectable. Here is some explanation in Python document: http://docs.python.org/library/gc.html#gc.garbage
So, to break the reference cycle, will need to call "del data.f"