I have the following code runs over a large set of data (2M). It eats up all my 4G mem before finishing.
for sample in session.query(CodeSample).yield_per(100):
for proj in projects:
if sample.filename.startswith(proj.abs_source):
sample.filename = "some other path"
session.add(sample)
Then I ran it though a reduced set of data and analyzed heap with heapy. get_rp() gave me the folloing hint
0: _ --- [-] 47821 (0x9163aec | 0x9165fec | 0x916d6cc | 0x9251414 | 0x925704...
1: a [-] 8244 tuple: 0x903ec8c*37, 0x903fcfc*13, 0x9052ecc*46...
2: aa ---- [S] 3446 types.CodeType: parseresult.py:73:src_path...
3: ab [S] 364 type: __builtin__.Struct, _random.Random, sqlite3.Cache...
4: ac ---- [-] 90 sqlalchemy.sql.visitors.VisitableType: 0x9162f2c...
5: aca [S] 11 dict of module: ..sql..., codemodel, sqlalchemy
6: acb ---- [-] 48 sqlalchemy.sql.visitors.VisitableType: 0x9162f2c...
7: acba [S] 9 dict of module: ..sql..., codemodel, sqlalchemy
8: acbb ---- [-] 45 sqlalchemy.sql.visitors.VisitableType: 0x9165fec...
9: acbba [S] 8 dict of module: ..sql..., codemodel, sqlalchemy
I'm new to sqlalchemy. Is this a memory leak? Thanks.
Most DBAPIs, including psycopg2 and mysql-python, fully load all results into memory before releasing them to the client. SQLA's yield_per() option doesn't work around this, with one exception below, which is why its generally not a very useful option(edit: useful in the sense that it begins streaming results before the actual rows are fully fetched).
The exceptions to this behavior are:
The session will keep track of all the
CodeSample
objects that you retrieve. So after iterating over 2M objects, the session keeps a reference to all of them. The session needs these references so it can write the correct changes to the database onflush
. So I believe what you're seeing is to be expected.To only keep N objects in memory at a time, you could do something like the code below (inspired by this answer, disclaimer: I have not tested it).
But the above is a bit clunky, perhaps some SQLAlchemy gurus knows how to better do this.
BTW, you should not need the session.add(), the session tracks changes to the objects. Why do you use
yield_per
(EDIT: I guess this is to fetch the rows in chunks from the DB, is that correct? The session will keep track of all of them anyway.)EDIT:
Hmm, looks like there is something I have misunderstood. From the docs:
and