Fast query in formatted data

2019-05-19 00:48发布

问题:

In my program I need to query through metadata.

I read data into numpy record array A from csv-like text file ** without duplicate rows**.

var1|var2|var3|var4|var5|var6
'a1'|'b1'|'c1'|1.2|2.2|3.4
'a1'|'b1'|'c4'|3.2|6.2|3.2
'a2'|''|'c1'|1.4|5.7|3.8
'a2'|'b1'|'c2'|1.2|2.2|3.4
'a3'|''|'c2'|1.2|2.2|3.4
'a1'|'b2'|'c4'|7.2|6.2|3.2
...

There are millions of rows and the query in nested loops can be up to billion times (mostly matching the first 3 columns), so the efficiency becomes critical.

There are 3 types of queries and the first one is the most frequent.
  • Get rows matching one or more of the first 3 columns with given strings, e.g.,

    • To match a record where var1='a2' and var2='b1',

       ind = np.logical_and(A['var1']=='a2', A['var2']=='b1')
      
    • To match a record where var1='a2', var2='b1' and var3='c1',

       ind = np.logical_and(np.logical_and(A['var1']=='a2', A['var2']=='b1'), A['var3']=='c1')
      

As one can see, each time we compare the all elements of columns with given strings.

I thought mapping could be a more efficient way for indexing, so I converted the recarray A to a dict D = {'var1_var2_var3: [var4, var5, var6], ...}, and search through the keys byfnmatch(keys, pat)`. I'm not sure it's a better way.

Or I can make a hierachical dict {'var1':{'var2':{'var3':[],...},...},...} or in-memory hdf5 /var1/var2/var3 and just try to get the item if exists. This looks the fastest way?

The latter two types of queries are not very frequent and I can accept the way of numpy recarray comparison.

  • Get all rows the numeric values in the latter columns in a specific range, e.g.,

    • get rows where '1

      ind = np.logical_and(1<A['var4']<3), 0<A['var5']<3)
      
  • A combination of the above two, e.g.,

    • get rows where var2='b1', '1

      ind = np.logical_and(np.logical_and(A['var2']=='b1', 1<A['var4']<3), 0<A['var5']<3)
      

SQL could be a good way but it looks too heavy to use database for this small task. And I don't have authority to install database support everywhere.

Any suggestions for data structure for fast in-memory query? (If it is hard to have a simple customed implementation, sqlite and pandas.dateframe seem to be possible solutions, as suggested.)

回答1:

With your file sample ('b' for py3)

In [51]: txt=b"""var1|var2|var3|var4|var5|var6
    ...: 'a1'|'b1'|'c1'|1.2|2.2|3.4
    ...: 'a1'|'b1'|'c4'|3.2|6.2|3.2
    ...: 'a2'|''|'c1'|1.4|5.7|3.8
    ...: 'a2'|'b1'|'c2'|1.2|2.2|3.4
    ...: 'a3'|''|'c2'|1.2|2.2|3.4
    ...: 'a1'|'b2'|'c4'|7.2|6.2|3.2"""

A simple read leaves me with the double layer of quoting

data = np.genfromtxt(txt.splitlines(), names=True, delimiter='|', dtype=None)

array([(b"'a1'", b"'b1'", b"'c1'", 1.2, 2.2, 3.4), ...
    dtype=[('var1', 'S4'), ('var2', 'S4'), ('var3', 'S4'), ('var4', '<f8'), ('var5', '<f8'), ('var6', '<f8')])

So I'll define a converter to strip those (a csv reader might do it as well):

def foo(astr):
    return eval(astr)

In [55]: A = np.genfromtxt(txt.splitlines(), names=True, delimiter='|', dtype='U3,U3,U3,f8,f8,f8', converters={0:foo,1:foo,2:foo})
In [56]: A
Out[56]: 
array([('a1', 'b1', 'c1', 1.2, 2.2, 3.4),
       ('a1', 'b1', 'c4', 3.2, 6.2, 3.2), 
       ('a2', '', 'c1', 1.4, 5.7, 3.8),
       ('a2', 'b1', 'c2', 1.2, 2.2, 3.4), 
       ('a3', '', 'c2', 1.2, 2.2, 3.4),
       ('a1', 'b2', 'c4', 7.2, 6.2, 3.2)], 
      dtype=[('var1', '<U3'), ('var2', '<U3'), ('var3', '<U3'), ('var4', '<f8'), ('var5', '<f8'), ('var6', '<f8')])

and I can write tests like

In [57]: (A['var1']=='a2')&(A['var2']=='b1')
Out[57]: array([False, False, False,  True, False, False], dtype=bool)
In [58]: (1<A['var4'])&(A['var4']<3)
Out[58]: array([ True, False,  True,  True,  True, False], dtype=bool)

The tests over all records of A are being done in compile numpy code, so they shouldn't be that slow.

This data could also be viewed as 2 multicolumn fields

In [59]: dt = np.dtype([('labels', '<U3', (3,)), ('data', '<f8', (3,))])
In [60]: A1 = A.view(dt)
In [61]: A1
Out[61]: 
array([(['a1', 'b1', 'c1'], [1.2, 2.2, 3.4]),
       (['a1', 'b1', 'c4'], [3.2, 6.2, 3.2]),
       (['a2', '', 'c1'], [1.4, 5.7, 3.8]),
       (['a2', 'b1', 'c2'], [1.2, 2.2, 3.4]),
       (['a3', '', 'c2'], [1.2, 2.2, 3.4]),
       (['a1', 'b2', 'c4'], [7.2, 6.2, 3.2])], 
      dtype=[('labels', '<U3', (3,)), ('data', '<f8', (3,))])

Or loaded directly with

A = np.genfromtxt(txt.splitlines(), skip_header=1, delimiter='|', dtype='(3)U3,(3)f8', converters={0:foo,1:foo,2:foo})

Then tests could be written as:

In [64]: (A1['labels'][:,0]=='a1') & (A1['labels'][:,1]=='b2') & ((A1['data']<6).any(axis=1))
Out[64]: array([False, False, False, False, False,  True], dtype=bool)

In [65]: (A1['labels'][:,[0,1]]==['a1','b2']).all(axis=1)
Out[65]: array([False, False, False, False, False,  True], dtype=bool)

Sometimes it might be clearer to give individual columns their own id:

var1 = A1['labels'][:,0]   # or A['var1']
....
(var1=='a1')&(var2='b1')&...

Repeated queries, or combinations could be saved.

I believe pandas stores its series in numpy arrays, with different dtype for each column (and object dtype if types varies within a column). But I haven't seen discussion of pandas speed and speed tricks. I don't expect much speed improvement unless it provides for some sort of indexing.

I can imagine writing this data to a database. sqlite3 is builtin and has a memory mode so you don't need file access. But I'm sufficiently out of practice with that code that I'll pass on demonstrating it. Nor do I have a sense of how easy or fast it is to do these sorts of queries.

https://mail.scipy.org/pipermail/scipy-user/2007-August/013350.html has some code that can save a structured array to a sqlite3 database. It includes a function that converts a dtype into a table creation statement.

====================

I've got that pipermail example working with python3. The test example has 11 fields. With 5000 records,

data[np.where(data['id']=='id2000')]

is 6x faster than a corresponding sqlite3 query (with an existing cursor):

cursor.execute('select * from data where id=?',('id2000',))
cursor.fetchone()


回答2:

Use Pandas, it is built for tasks like this:

# Import
import pandas as pd

# Read CSV
df = pd.read_csv('/path/to/file.csv')

# Selection criteria
# using `.query` method:
df.query('var1 == "a2" & var3 == "c1"')
df.query('var2 == "b1" & 1 < var4 < 3 & 0 < var5 < 3')
# using indexing:
df[(df['var1'] == 'a2') & (df['var3'] == 'c1')]
df[(df['var2'] == 'b1') & df['var4'].between(1,3) & df['var5'].between(0,3)]
# using `.where` method:
df.where((df['var1'] == 'a2') & (df['var3'] == 'c1'))
df.where(df['var2'] == 'b1') & df['var4'].between(1,3) & df['var5'].between(0,3))

More indexing and selecting information