Pandas fillna with a lookup table

2020-07-13 09:45发布

Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column. (You might recognize my data from the Titanic data set)...

    Pclass   Age
0   1        33
1   3        24
2   1        23
3   2        NaN
4   1        Nan

I want to fill the NaN with a value from series 'pclass_lookup':

pclass_lookup
1        38.1
2        29.4
3        25.2

I have tried doing fillna with indexing like:

df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of 
    ValueError: cannot reindex from a duplicate axis

lambdas were a try too:

df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]

but, that seems not to fill it right, either. Am I totally missing the boat here? '

标签: python pandas
3条回答
等我变得足够好
2楼-- · 2020-07-13 10:25

Following should work for you:

df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
   Pclass  Age
0       1   33
1       3   24
2       1   23
3       2  NaN
4       1  NaN

pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1    38.1
2    29.4
3    25.2
dtype: float64

def remove_na(x):
    if pd.isnull(x['Age']):
        return pclass_lookup[x['Pclass']]
    else:
        return x['Age']
df['Age'] =df.apply(remove_na, axis=1)

   Pclass   Age
0       1  33.0
1       3  24.0
2       1  23.0
3       2  29.4
4       1  38.1
查看更多
在下西门庆
3楼-- · 2020-07-13 10:33

Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.

So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:

In [317]:

df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
   Pclass   Age
0       1    33
1       3    24
2       1    23
3       2  29.4
4       1  38.1

Timings

For a df with 5000 rows:

In [26]:

%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:

%%timeit
def remove_na(x):
    if pd.isnull(x['Age']):
        return df1[x['Pclass']]
    else:
        return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:

%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop

So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.

查看更多
乱世女痞
4楼-- · 2020-07-13 10:37

Building on the response of @vrajs5:

# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))

# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values

>>> df
   Pclass   Age
0       1  33.0
1       3  24.0
2       1  23.0
3       2  29.4
4       1  38.1
查看更多
登录 后发表回答