To manage the amount of RAM I consume in doing an analysis, I have a large dataset stored in hdf5 (.h5) and I need to query this dataset efficiently using Pandas.
The data set contains user performance data for a suite of apps. I only want to pull a few fields out of the 40 possible, and then filter the resulting dataframe to only those users who are using a one of a few apps that interest me.
# list of apps I want to analyze
apps = ['a','d','f']
# Users.h5 contains only one field_table called 'df'
store = pd.HDFStore('Users.h5')
# the following query works fine
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1'])
# the following pseudo-query fails
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1', 'app in apps'])
I realize that the string 'app in apps' is not what I want. This is simply a SQL-like representation of what I hope to achieve. I cant seem to pass a list of strings in any way that I try, but there must be a way.
For now I am simply running the query without this parameter and then I filter out the apps I don't want in a subsequent step thusly
df = df[df['app'].isin(apps)]
But this is much less efficient since ALL of the apps need to first be loaded into memory before I can remove them. In some cases, this is big problem because I don't have enough memory to support the whole unfiltered df.