I need to do a lot of successive queries on time series data in specific time spans from a HDF5 database (the data is stord in seconds, not always "continuous", I only know the start and end time). Therefore, I wonder wether there is a faster solution than my current code, which was inspired by this answer:
import pandas as pd
from pandas import HDFStore
store = HDFStore(pathToStore)
dates = pd.date_range(start=start_date,end=end_date, freq='S')
index = store.select_column('XAU','index')
ts = store.select('XAU', where=index[index.isin(dates)].index)
Any comments and suggestions are highly appreciated, thx!
Let's test it !
Generating 1M rows DF:
Let's shuffle it:
Storing generated DF into HDF5 file (NOTE: per default only index is indexed, so if you are going to search also by other columns, use
data_columns
parameter):Let's test
select(where="<query>")
method:Measuring performance:
Let's compare it with your current approach:
UPDATE: let's do the same test, but this time assuming that the index (time series) is sorted: