Lookup with Missing Labels

2019-08-07 15:20发布

问题:

I have a code that uses a dataframe to look up a value (P) given it's column label (X):

df_1 = pd.DataFrame({'X': [1,2,3,1,1,2,1,3,2,1]})

df_2 = pd.DataFrame({ 1 : [1,2,3,4,1,2,3,4,1,2],
                      2 : [4,1,2,3,4,1,2,1,2,3],
                      3 : [2,3,4,1,2,3,4,1,2,5]})

df_1['P'] = df_2 .lookup(df_1.index, df_1['X'])

When I give it a label in df_1 but don't include that label in df_2, like this:

df_1 = pd.DataFrame({'X': [7,2,3,1,1,2,1,3,2,1]})

I get:

KeyError: 'One or more column labels was not found'

How can I skip those ones please, to get:

   X  P
0  7  NaN
1  2  1
2  3  4
3  1  4
4  1  1
5  2  1
6  1  3
7  3  1
8  2  2
9  1  2

回答1:

get and default values

def get_lu(df):
  def lu(i, j):
    return df.get(j, {}).get(i, np.nan)
  return lu

[*map(get_lu(df_2), df_1.index, df_1.X)]

[nan, 1, 4, 4, 1, 1, 3, 1, 2, 2]

Alternative

[df_2.get(j, {}).get(i, np.nan) for i, j in df_1.X.items()]

[nan, 1, 4, 4, 1, 1, 3, 1, 2, 2]

All together

df_1.assign(P=[df_2.get(j, {}).get(i, np.nan) for i, j in df_1.X.items()])

   X    P
0  7  NaN
1  2  1.0
2  3  4.0
3  1  4.0
4  1  1.0
5  2  1.0
6  1  3.0
7  3  1.0
8  2  2.0
9  1  2.0

Uglier version

df_1.assign(P=[df_2.rename_axis('X', 1).stack().get(x, np.nan) for x in df_1.X.items()])

   X    P
0  7  NaN
1  2  1.0
2  3  4.0
3  1  4.0
4  1  1.0
5  2  1.0
6  1  3.0
7  3  1.0
8  2  2.0
9  1  2.0


回答2:

From the document adding try ...except

result = []
for row, col in zip(df_1.index, df_1.X):
    try :
        result.append(df_2.loc[row, col])
    except :
        result.append(np.nan)

result
Out[135]: [nan, 1, 4, 4, 1, 1, 3, 1, 2, 2]


回答3:

A tad slower than @piRSquared, but using loc + lambda:

>> df_1['P'] = df_1.apply(lambda x: df_2.loc[x.name, x.values[0]] if x.values[0] in df_2.columns else np.nan, axis=1)
>> df_1

    X   P
0   7   NaN
1   2   1.0
2   3   4.0
3   1   4.0
4   1   1.0
5   2   1.0
6   1   3.0
7   3   1.0
8   2   2.0
9   1   2.0


回答4:

this answer uses numpy and is fast...

import numpy as np

setup dataframes

df_1 = pd.DataFrame({'X': [7,2,3,1,1,2,1,3,2,1]})

df_2 = pd.DataFrame({ 1 : [1,2,3,4,1,2,3,4,1,2],
                      2 : [4,1,2,3,4,1,2,1,2,3],
                      3 : [2,3,4,1,2,3,4,1,2,5]})

-

# designate working columns
lookup_cols = [1, 2, 3]
key_col = 'X'
result_col = 'P'

# get key column values as an array
key = df_1[key_col].values

# make an array of nans to hold the lookup results
result = np.full(key.shape[0], np.nan)

# create a boolean array containing only valid lookup indexes
b = np.isin(key, lookup_cols)

# filter df_1 and df_2 with boolean array b
df_1b = df_1[b]
df_2b = df_2[b]

# lookup values using filtered dataframes
lup = df_2b.lookup(df_1b.index, df_1b[key_col])
# put the results into the result array at proper index locations using b
result[b] = lup
# assign the result array to the dataframe result column
df_1[result_col] = result


回答5:

If, rather than the index, I wanted to use another column from df_1, then piRSquared's answer becomes:

df_1 = pd.DataFrame({'M' : ['X','Y','Z','X','Y','F','Y'],
                     'N' : ['A','C','B','B','A','A','F']})

df_2 = pd.DataFrame({'A' : [1,2,3],
                     'B' : [4,1,2],
                     'C' : [2,3,4]},
                     index = ['X', 'Y', 'Z'])

def get_lu(df):
  def lu(i, j):
    return df.get(j, {}).get(i, np.nan)
  return lu

df_1['O'] = [*map(get_lu(df_2), df_1.M, df_1.N)]

Which gives:

   M  N    O
0  X  A  1.0
1  Y  C  3.0
2  Z  B  2.0
3  X  B  4.0
4  Y  A  2.0
5  F  A  NaN
6  Y  F  NaN