I have a pandas HDFStore that I am try to select from. I would like to select data between a two timestamps with an id in a large np.array. The following code works but takes up too much memory only when queried for membership in a list. If I use a datetimeindex and a range, the memory footprint is 95% less.
#start_ts, end_ts are timestamps
#instruments is an array of python objects
not_memory_efficient = adj_data.select("US", [Term("date",">=", start_ts),
Term("date", "<=", end_ts),
Term("id", "=", instruments)])
memory_efficient = adj_data.select("US", [Term("date",">=", start_ts),
Term("date", "<=", end_ts),)
Is there a more memory efficient way to do this in HDFStore? Should I set the index to the "sec_id"? (I can also use the chunksize option and concat myself, but that seems to be a bit of a hack.)
Edits:
The hdfstore is created by pd.HDFStore creating a dataframe and storing such as this. I made a mistake earlier
def write_data(country_data, store_file):
for country in country_data:
if len(country_data[country]) == 0:
continue
df = pd.concat(country_data[country], ignore_index=True)
country_data[country] = []
store_file.append(country, df, format="t")
As requested, here is the ptdump for this table: https://gist.github.com/MichaelWS/7980846 also, here is the df: https://gist.github.com/MichaelWS/7981451
You cannot supply a large list to be selected through and not have the entire pandas object loaded into memory. This is a limit in how numexpr operates.
pandas issue: https://github.com/pydata/pandas/issues/5717
pytables issue: http://sourceforge.net/mailarchive/message.php?msg_id=30390757
To memorialize this for other users.
In HDFStore, is required to designate certain columns as data_columns if they are not the index in order to later query then.
Docs are here
Create a frame
Save to hdf WITHOUT data_columns
0.13 will report this error (0.12 will just silently ignore)
Set all the columns as data columns (can also be a specific list of columns)
Here is a the Table node of
ptdump -av
of the file:The key thing to note is that the data_columns are separate in the 'description', AND they are setup as indexes.