Pandas dataframe and speed

2019-09-07 16:30发布

问题:

I have a pandas dataframe object which I have preallocated with 400 000 entries. 2 columns a timestamp of type datetime.datetime and a float number. When I attempt to insert (overwrite) a row in the table it seems rather slow, depending on the size of the table I get something like 0.044seconds. I have created an index of integer and I am using this index to access the row. Here is how I am using it:

maxsize = 400000
data = pd.DataFrame({'ts' : date_list, 'val' : zeros}, index=range(maxsize))
# later on, the next statement is "slow"
data.iloc[0] = pd.Series({'ts' : datetime.datetime.now(), 'val': val})

As per me investigation, the last statement takes about 0.044seconds on my machine (i7-4650U). This seems quite slow. Is there something that I am doing fundamentally wrong ? Could I use something like a HDF Store to improve write speeds, but keep high read speeds also ?

Thanks.

回答1:

You are assigning a series that is object dtype, iow, its mixed. So when the element assignment happens, the datetime needs to get converted. All of this is cheap; what is expensive are that each column needs to be internally copied to guard against dtype changes. Their is a fair amount of validation in the assignments to handle lots of edge cases.

In [23]: data = pd.DataFrame({'ts' : pd.date_range('20130101',freq='s',periods=maxsize), 'val' : 0}, index=range(maxsize))

In [24]: s = Series({'ts' : datetime.datetime.now(), 'val' : 1 })

In [25]: %timeit data.iloc[-1] = s
100 loops, best of 3: 10.6 ms per loop

You can bypass a lot of this, but do an item-by-item assignment. This is quite fast, but you have to make sure that your dtypes are compatible.

In [26]: def f():
    data.iat[-1,0] = s['ts']
    data.iat[-1,1] = s['val']
   ....:     

In [27]: data.tail()              
Out[27]: 
                               ts  val
399995        2013-01-05 15:06:35    0
399996        2013-01-05 15:06:36    0
399997        2013-01-05 15:06:37    0
399998        2013-01-05 15:06:38    0
399999 2015-02-24 06:03:58.344166    1

In [28]: %timeit f()
10000 loops, best of 3: 35.2 us per loop


回答2:

I think your solution is more a process than a programming one. Why use Python as a data storage handler since you worry about performance? Essentially, Python would resemble a client that interacts with data pulled from an external source, namely a dedicated database like MySQL or SQLite (using ODBC/OLEDB).

So, why not structure your dataset (append rows, update records, select columns) using an indexed, relational, SQL-engine database beforehand, then import into Python data frames for analytical/graphical purposes? Examples include:

DATABASE CONNECTION

conn = sqlite3.connect("databasename.sqlite")
df = pd.read_sql("SELECT [field1], [field2] from datatable", conn)
df

APPEND ROWS

conn = sqlite3.connect('databasename.sqlite')
cur = conn.cursor()
sql =  "INSERT INTO datatable (field1, field2, field3) VALUES ('{0}','{1}','{2}');".format(items[0], items[1], items[2])

cur.execute(sql)
db.commit()

CSV EXPORT/IMPORT

conn = sqlite3.connect('databasename.sqlite')
cur = conn.cursor()
cur.execute("SELECT [field1], [field2] from datatable")

a = csv.writer(open('datafile.csv', 'w', newline=''))
for row in cur.fetchall() :
    a.writerow(row)

filepath = 'datafile.csv'  # OUTPUTTED PRIOR FROM DATABASE
tp = pd.io.parsers.read_csv(filepath, sep=',', iterator=True, chunksize=1000, encoding = "ISO-8859-1")
finaldf = pd.concat(list(tp), ignore_index=True)