I am working on using the hdfstore in pandas to data frames from an ongoing iterative process. At each iteration, I append to a table in the hdfstore. Here is a toy example:
import pandas as pd
from pandas import HDFStore
import numpy as np
from random import choice
from string import ascii_letters
alphanum=np.array(list(ascii_letters)+range(0,9))
def hdfstore_append(storefile,key,df,format="t",columns=None,data_columns=None):
if df is None:
return
if key[0]!='/':
key='/'+key
with HDFStore(storefile) as store:
if key not in store.keys():
store.put(key,df,format=format,columns=columns,data_columns=data_columns)
else:
try:
store.append(key,df)
except Exception as inst:
df = pd.concat([store.get(key),df])
store.put(key,df,format=format,columns=columns,
data_columns=data_columns)
storefile="db.h5"
for i in range(0,100):
df=pd.DataFrame([dict(n=np.random.randn(),
s=''.join(alphanum[np.random.randint(1,len(alphanum),np.random.randint(1,2*(i+1))]))],index=[i])
hdfstore_append(storefile,'/SO/df',df,columns=df.columns,data_columns=True)
The hdfstore_append function guards against the various exceptions hdfstore.append throws, and rebuilds the table when necessary. The issue with this approach is that it gets very slow when the table in the store becomes very large.
Is there a more efficient way to do this?
Below is an example of an efficient method for building large pandas hdfstores. The key is to cache the frame numbers when the tables becomes large. Also instead of appending, removing pre-existing data will essentially create a put.