可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

My issue is that when I try to look for a string that is NOT contained in the DataFrame (which is stored in an hdf5 file), it takes a very long time to complete the query. For example:

I have a df that contains 2*10^9 rows. It is stored in an HDF5 file. I have a string column named "code", that was marked as "data_column" (therefore it is indexed).

When I search for a code that exists in the dataset ( store.select('df', 'code=valid_code') ) it takes around 10 seconds to get 70K rows.

However, when I search for a code that does NOT exist in the dataset ( store.select('df', 'code=not_valid_code') ) it takes around 980 seconds to get the result of the query (0 rows).

I create the store like: store = pd.HDFStore('data.h5', complevel=1, complib='zlib') And the first append is like: store.append('df', chunk, data_columns=['code'], expectedrows=2318185498)

Is this behavior normal or is there something wrong going on?

Thanks!

PS: this question is probably related with this other question

UPDATE:

Following Jeff's advice, I replicated his experiment, and I got the following results on a Mac. This is the table that was generated:

!ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(50000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (8192,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string64',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 50000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

And these are the results:

In [8]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 277 ms per loop

In [9]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 391 ms per loop

In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 533 ms per loop

In [11]: %timeit pd.read_hdf('test_zlib2.h5','df',where='A = "bar"')
1 loops, best of 3: 504 ms per loop

Since the differences were maybe not big enough, I tried the same experiment but with a bigger dataframe. Also, I did this experiment on a different machine, one with Linux.

This is the code (I just multiplied the original dataset by 10):

import pandas as pd

df = pd.DataFrame({'A' : [ 'foo%05d' % i for i in range(500000) ]})

df = pd.concat([ df ] * 20)

store = pd.HDFStore('test.h5',mode='w')

for i in range(50):
    print "%s" % i
    store.append('df',df,data_columns=['A'])

This is the table:

!ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(500000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=9, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (15420,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string72',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 500000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

These are the files:

-rw-rw-r-- 1 user user 8.2G Oct  5 14:00 test.h5
-rw-rw-r-- 1 user user 9.9G Oct  5 14:30 test_zlib.h5

And these are the results:

In [9]:%timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 1.02 s per loop

In [10]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 980 ms per loop

In [11]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 7.02 s per loop

In [12]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
1 loops, best of 3: 7.27 s per loop

These are my versions of Pandas and Pytables:

user@host:~/$ pip show tables
---
Name: tables
Version: 3.1.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: 

user@host:~/$ pip show pandas
---
Name: pandas
Version: 0.14.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: python-dateutil, pytz, numpy

Although I am quite sure that the issue is not related with Pandas, since I have observed similar behavior when using only Pytables without Pandas.

UPDATE 2:

I have switched to Pytables 3.0.0 and the problem got fixed. This is using the same files that were generated with Pytables 3.1.1.

In [4]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 205 ms per loop

In [4]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 101 ms per loop

回答1:

I think your issue is one which we filed a bug a while ago here with the PyTables guys. Essentially, when using a compressed store AND specifying expectedrows AND using an indexed columns causes mis-indexing.

The soln is simply NOT to use expectedrows, and rather to ptrepack the file with a specified chunkshape (or AUTO). This is good practice anyhow. Further, not sure if you specifying compression up-front, but it is IMHO better to do this via ptrepack, see docs here. Their is also an issue on SO about this (can't find it right now, essentially if you are creating the file, don't don't index up-front but when you are done appending, if you can).

In any event, creating a test store:

In [1]: df = DataFrame({'A' : [ 'foo%05d' % i for i in range(50000) ]})

In [2]: df = pd.concat([ df ] * 20)

Append 50M rows.

In [4]: store = pd.HDFStore('test.h5',mode='w')

In [6]: for i in range(50):
   ...:     print "%s" % i
   ...:     store.append('df',df,data_columns=['A'])
   ...:

Here is the table

In [9]: !ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(50000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (8192,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string64',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 50000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

Create a blosc and zlib version.

In [12]: !ptrepack --complib blosc --chunkshape auto --propindexes test.h5 test_blosc.h5

In [13]: !ptrepack --complib zlib --chunkshape auto --propindexes test.h5 test_zlib.h5

In [14]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 866182540 Oct  4 20:31 test.h5
-rw-rw-r-- 1 jreback users 976674013 Oct  4 20:36 test_blosc.h5
-rw-rw-r-- 1 jreback users 976674013 Oct  4  2014 test_zlib.h5

Perf is pretty similar (for the found rows)

In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 337 ms per loop

In [15]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "foo00002"')
1 loops, best of 3: 345 ms per loop

In [16]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 347 ms per loop

And missing rows (though the compressed do perform better here).

In [11]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
10 loops, best of 3: 82.4 ms per loop

In [17]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "bar"')
10 loops, best of 3: 32.2 ms per loop

In [18]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 32.3 ms per loop

So. try w/o the expected rows specifier, and use ptrepack.

Another possiblity if you are expecting a relatively low density of entries for this column (e.g. a smaller number of unique entries). Is to select the entire column, store.select_column('df','A').unique() in this case, and use that as a quick lookup mechanism (so you don't search at all).