How to match multiple columns in pandas DataFrame

2020-02-28 19:41发布

问题:

I have the following pandas DataFrame:

import pandas as pd
df = pd.DataFrame('filename.csv')
print(df)

order    start    end    value    
1        1342    1357    category1
1        1459    1489    category7
1        1572    1601    category23
1        1587    1599    category2
1        1591    1639    category1
....
15        792     813    category13
15        892     913    category5
....

So, there is an order column encompasses many rows each, and then a range/interval from start to end for each row. Each row then is labeled by a certain value (e.g. category1, category2, etc.)

Now I have another dataframe called key_df. It is basically the exact same format:

import pandas as pd
key_df = pd.DataFrame(...)
print(key_df)

order    start    end    value    
1        1284    1299    category4
1        1297    1309    category9
1        1312    1369    category3
1        1345    1392    category29
1        1371    1383    category31
....
1        1471    1501    category31
...

My goal is to take the key_df dataframe and check whether the intervals start:end match any of the rows in the original dataframe df. If it does, this row in df should be labeled with the key_df dataframe's value value.

In our example above, the dataframe df would end up like this:

order    start    end    value        key_value
1        1342    1357    category1    category29
1        1459    1489    category7    category31
....

This is because if you look at key_df, the row

1        1345    1392    category29

with interval 1::1345-1392 falls in the interval 1::1342-1357 in the original df. Likewise, the key_df row:

1        1471    1501    category31

corresponds to the second row in df:

1        1459    1489    category7    category31

I'm not entirely sure

(1) how to accomplish this task in pandas

(2) how to scale this efficiently in pandas

One could begin with an if statement, e.g.

if df.order == key_df.order:
    # now check intervals...somehow

but this doesn't take advantage of the dataframe structure. One then must check by interval, i.e. something like (df.start =< key_df.start) && (df.end => key_df.end)

I'm stuck. What is the most efficient way to match multiple columns in an "interval" in pandas? (Creating a new column if this condition is met is then straightforward)

回答1:

You can use merge with boolean indexing, but if DataFrames are large, scaling is problematic:

df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
print (df1)
    order  start   end      value  start_key  end_key   value_key
3       1   1342  1357  category1     1345.0   1392.0  category29
4       1   1342  1357  category1     1371.0   1383.0  category31
5       1   1342  1357  category1     1471.0   1501.0  category31
11      1   1459  1489  category7     1471.0   1501.0  category31

EDIT by comment:

df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
df1 = pd.merge(df, df1, on=['order','start','end', 'value'], how='left')
print (df1)
   order  start   end       value  start_key  end_key   value_key
0      1   1342  1357   category1     1345.0   1392.0  category29
1      1   1342  1357   category1     1371.0   1383.0  category31
2      1   1342  1357   category1     1471.0   1501.0  category31
3      1   1459  1489   category7     1471.0   1501.0  category31
4      1   1572  1601  category23        NaN      NaN         NaN
5      1   1587  1599   category2        NaN      NaN         NaN
6      1   1591  1639   category1        NaN      NaN         NaN
7     15    792   813  category13        NaN      NaN         NaN
8     15    892   913   category5        NaN      NaN         NaN