I am trying to populate a MS SQL 2005 database using python on windows. I am inserting millions of rows, and by 7 million I am using almost a gigabyte of memory. The test below eats up 4 megs of RAM for each 100k rows inserted:
import pyodbc
connection=pyodbc.connect('DRIVER={SQL Server};SERVER=x;DATABASE=x;UID=x;PWD=x')
cursor=connection.cursor()
connection.autocommit=True
while 1:
cursor.execute("insert into x (a,b,c,d, e,f) VALUES (?,?,?,?,?,?)",1,2,3,4,5,6)
mdbconn.close()
Hack solution: I ended up spawning a new process using the multiprocessing module to return memory. Still confused about why inserting rows in this way consumes so much memory. Any ideas?
I had the same issue, and it looks like a pyodbc issue with parameterized inserts: http://code.google.com/p/pyodbc/issues/detail?id=145
Temporarily switching to a static insert with the VALUES clause populated eliminates the leak, until I try a build from the current source.
Try creating a separate cursor for each insert. Reuse the cursor variable each time through the loop to implicitly dereference the previous cursor. Add a connection.commit after each insert.
You may only need something as simple as a time.sleep(0) at the bottom of each loop to allow the garbage collector to run.
You could also try forcing a garbage collection every once in a while with
gc.collect()
after importing thegc
module.Another option might be to use
cursor.executemany()
and see if that clears up the problem. The nasty thing aboutexecutemany()
, though, is that it takes a sequence rather than an iterator (so you can't pass it a generator). I'd try the garbage collector first.EDIT: I just tested the code you posted, and I am not seeing the same issue. Are you using an old version of pyodbc?
Maybe close & re-open the connection every million rows or so?
Sure it doesn't solve anything, but if you only have to do this once you could get on with life!
Even I had faced the same problem.
I had to read more than 50 XML files each about 300 MB and load them into SQL Server 2005.
I tried the following :
Using the same cursor by dereferencing.
Closing /opening the connection
Setting the connection to None.
Finally ended up bootstrapping each XML file load using Process module.
Now I have replaced the process using IronPython - System.Data.SqlClient.
This give a better performance and also better interface.