Fastest way to load numeric data into python/panda

2019-03-09 18:00发布

问题:

I want to read some numeric (double, i.e. float64) data from a MySQL table. The size of the data is ~200k rows.

MATLAB reference:

tic;
feature accel off;
conn = database(...);
c=fetch(exec(conn,'select x,y from TABLENAME'));
cell2mat(c.data);
toc

Elapsed time is ~1 second.

Doing the same in python, using the several examples found in here (I have tried them all, i.e. using pandas read_frame, frame_query and the __processCursor function): How to convert SQL Query result to PANDAS Data Structure?

Reference python code:

import pyodbc
import pandas.io.sql as psql
import pandas
connection_info = "DRIVER={MySQL ODBC 3.51 \
Driver};SERVER=;DATABASE=;USER=;PASSWORD=;OPTION=3;"
cnxn = pyodbc.connect(connection_info)
cursor = cnxn.cursor()
sql = "select x,y from TABLENAME"
#cursor.execute(sql)
#dataframe = __processCursor(cursor, dataframe=True)
#df = psql.frame_query(sql, cnxn, coerce_float=False)
df = psql.read_frame(sql, cnxn)
cnxn.close()

Takes ~6 seconds. Profiler says all the time spent was in read_frame. I was wondering if anyone could give me some hints how could this be speedup to at least match the MATLAB code. And if that is possible at all in python.

EDIT:

The bottleneck seems to be inside the cursor.execute (in pymysql library) or cursor.fetchall() in pyodbc library. The slowest part is reading the returned MySQL data element by element (row by row, column by column) and converting it to the data type which it inferred previously by the same library.

So far I have managed to speed this up to close to MATLAB by doing this really dirty solution:

import pymysql
import numpy

conn = pymysql.connect(host='', port=, user='', passwd='', db='')
cursor = conn.cursor()
cursor.execute("select x,y from TABLENAME")
rez = cursor.fetchall()
resarray = numpy.array(map(float,rez))
finalres = resarray.reshape((resarray.size/2,2))

The above cur.execute IS NOT THE pymysql EXECUTE! I have modified it, inside the file "connections.py". First, the function def _read_rowdata_packet, now has instead of:

rows.append(self._read_row_from_packet(packet))

substituted with

self._read_string_from_packet(rows,packet)

Here _read_string_from_packet is a simplified version of _read_row_from_packet with the code:

def _read_string_from_packet(self, rows, packet):
    for field in self.fields:
        data = packet.read_length_coded_string()
        rows.append(data)

This is an uber-dirty solution which gives a speedup down from 6 seconds to 2.5 seconds. I was wondering, if all of this could somehow be avoided by using a different library/passing some parameters?

Hence the solution would be to bulk-read the entire MySQL reply to a list of strings and then bulk-type converting to numerical data types, instead of doing that element-by-element. Does something like that already exist in python?

回答1:

The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float

instead of decimal.Decimal seems to completely solve the problem and now the following code:

import MySQLdb
import numpy
import time

t = time.time()
conn = MySQLdb.connect(host='',...)
curs = conn.cursor()
curs.execute("select x,y from TABLENAME")
data = numpy.array(curs.fetchall(),dtype=float)
print(time.time()-t)

Runs in less than a second! What is funny, decimal.Decimal never appeared to be the problem in the profiler.

Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

UPDATE

Here is a solution not requiring to modify the MySQLdb source code: Python MySQLdb returns datetime.date and decimal The solution then to load numeric data into pandas:

import MySQLdb
import pandas.io.sql as psql
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host='',user='',passwd='',db='')
sql = "select * from NUMERICTABLE"
df = psql.read_frame(sql, conn)

Beats MATLAB by a factor of ~4 in loading 200k x 9 table!



回答2:

Also check out this way of doing things using the turbodbc package. To transform your result set into an OrderedDict of NumPy arrays, just do this:

import turbodbc
connection = turbodbc.connect(dsn="My data source name")
cursor = connection.cursor()
cursor.execute("SELECT 42")
results = cursor.fetchallnumpy()

Transforming these results to a dataset should require a few additional milliseconds. I don't know the speedup for MySQL, but I have seen factor 10 for other databases.

The speedup is mainly achieved by using bulk operations instead of row-wise operations.