My issue is that when I try to look for a string that is NOT contained in the DataFrame (which is stored in an hdf5 file), it takes a very long time to complete the query. For example:
I have a df that contains 2*10^9 rows. It is stored in an HDF5 file. I have a string column named "code", that was marked as "data_column" (therefore it is indexed).
When I search for a code that exists in the dataset ( store.select('df', 'code=valid_code') ) it takes around 10 seconds to get 70K rows.
However, when I search for a code that does NOT exist in the dataset ( store.select('df', 'code=not_valid_code') ) it takes around 980 seconds to get the result of the query (0 rows).
I create the store like: store = pd.HDFStore('data.h5', complevel=1, complib='zlib') And the first append is like: store.append('df', chunk, data_columns=['code'], expectedrows=2318185498)
Is this behavior normal or is there something wrong going on?
Thanks!
PS: this question is probably related with this other question
UPDATE:
Following Jeff's advice, I replicated his experiment, and I got the following results on a Mac. This is the table that was generated:
!ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(50000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (8192,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string64',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 50000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
And these are the results:
In [8]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 277 ms per loop
In [9]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 391 ms per loop
In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 533 ms per loop
In [11]: %timeit pd.read_hdf('test_zlib2.h5','df',where='A = "bar"')
1 loops, best of 3: 504 ms per loop
Since the differences were maybe not big enough, I tried the same experiment but with a bigger dataframe. Also, I did this experiment on a different machine, one with Linux.
This is the code (I just multiplied the original dataset by 10):
import pandas as pd
df = pd.DataFrame({'A' : [ 'foo%05d' % i for i in range(500000) ]})
df = pd.concat([ df ] * 20)
store = pd.HDFStore('test.h5',mode='w')
for i in range(50):
print "%s" % i
store.append('df',df,data_columns=['A'])
This is the table:
!ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(500000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=9, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (15420,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string72',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 500000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
These are the files:
-rw-rw-r-- 1 user user 8.2G Oct 5 14:00 test.h5
-rw-rw-r-- 1 user user 9.9G Oct 5 14:30 test_zlib.h5
And these are the results:
In [9]:%timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 1.02 s per loop
In [10]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 980 ms per loop
In [11]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 7.02 s per loop
In [12]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
1 loops, best of 3: 7.27 s per loop
These are my versions of Pandas and Pytables:
user@host:~/$ pip show tables
---
Name: tables
Version: 3.1.1
Location: /usr/local/lib/python2.7/dist-packages
Requires:
user@host:~/$ pip show pandas
---
Name: pandas
Version: 0.14.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: python-dateutil, pytz, numpy
Although I am quite sure that the issue is not related with Pandas, since I have observed similar behavior when using only Pytables without Pandas.
UPDATE 2:
I have switched to Pytables 3.0.0 and the problem got fixed. This is using the same files that were generated with Pytables 3.1.1.
In [4]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 205 ms per loop
In [4]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 101 ms per loop