Pandas fillna with a lookup table

Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column. (You might recognize my data from the Titanic data set)...

    Pclass   Age
0   1        33
1   3        24
2   1        23
3   2        NaN
4   1        Nan

I want to fill the NaN with a value from series 'pclass_lookup':

pclass_lookup
1        38.1
2        29.4
3        25.2

I have tried doing fillna with indexing like:

df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of 
    ValueError: cannot reindex from a duplicate axis

lambdas were a try too:

df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]

but, that seems not to fill it right, either. Am I totally missing the boat here? '

标签： python pandas

3条回答

等我变得足够好

2楼-- · 2020-07-13 10:25

Following should work for you:

df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
   Pclass  Age
0       1   33
1       3   24
2       1   23
3       2  NaN
4       1  NaN

pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1    38.1
2    29.4
3    25.2
dtype: float64

def remove_na(x):
    if pd.isnull(x['Age']):
        return pclass_lookup[x['Pclass']]
    else:
        return x['Age']
df['Age'] =df.apply(remove_na, axis=1)

   Pclass   Age
0       1  33.0
1       3  24.0
2       1  23.0
3       2  29.4
4       1  38.1

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2020-07-13 10:33

Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.

So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:

In [317]:

df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
   Pclass   Age
0       1    33
1       3    24
2       1    23
3       2  29.4
4       1  38.1

Timings

For a df with 5000 rows:

In [26]:

%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:

%%timeit
def remove_na(x):
    if pd.isnull(x['Age']):
        return df1[x['Pclass']]
    else:
        return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:

%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop

So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.

0人赞添加讨论(0) 举报

乱世女痞

4楼-- · 2020-07-13 10:37

Building on the response of @vrajs5:

# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))

# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values

>>> df
   Pclass   Age
0       1  33.0
1       3  24.0
2       1  23.0
3       2  29.4
4       1  38.1

0人赞添加讨论(0) 举报

Pandas fillna with a lookup table

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间