With this table:
CREATE TABLE test_insert (
col1 INT,
col2 VARCHAR(10),
col3 DATE
)
the following code takes 40 seconds to run:
import pyodbc
from datetime import date
conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')
rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
rows.append(row)
cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)
conn.commit()
The equivalent code with psycopg2 only takes 3 seconds. I don't think mssql is that much slower than postgresql. Any idea on how to improve the bulk insert speed when using pyodbc?
EDIT: Add some notes following ghoerz's discovery
In pyodbc, the flow of executemany
is:
- prepare statement
- loop for each set of parameters
- bind the set of parameters
- execute
In ceODBC, the flow of executemany
is:
- prepare statement
- bind all parameters
- execute
I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.
I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.
Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!
comConn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
comConn.Open(DSN)
rs = win32com.client.Dispatch(r'ADODB.Recordset')
rs.Open('[' + tblName +']', comConn, 1, 3)
for f in values:
rs.AddNew(fldLST, f)
rs.Update()
Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.
Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).
import pyodbc
CHUNK_SIZE = 5000
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n): #use xrange in python2, range in python3
yield l[i:i + n]
mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
server='<SERVER,PORT>',
timeout=1,
port=<PORT>,
uid=<UNAME>,
pwd=<PWD>,
TDS_Version=7.2,
autocommit=False) #IMPORTANT
mssql_cur = mssql_conn.cursor()
mssql_cur.fast_executemany = True #IMPORTANT
params = [tuple(x) for x in df.values]
stmt = "truncate table <THE TABLE>"
mssql_cur.execute(stmt)
mssql_conn.commit()
stmt = """
INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)
"""
for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT
mssql_cur.executemany(stmt, chunk)
mssql_conn.commit()
pyodbc 4.0.19 added a Cursor#fast_executemany
option to help address this issue. See this answer for details.
I wrote data to text file and then invoked BCP utility. Much much quicker. From about 20 to 30 minutes to a few seconds.
I was using pypyODBC w/ python 3.5 and Microsoft SQL Server Management Studio.
A particular table ( ~70K rows w/ 40 vars) was taking 112 seconds to INSERT using the .executemany() method with pypyodbc.
With ceODBC it took 4 seconds.