Can I replace Nans with the mode of a column in a

2019-07-09 06:44发布

问题:

I have some data that looks like...

Year      Make   Model  Trim
2007     Acura  TL      Base
2010     Dodge  Avenger SXT
2009     Dodge  Caliber SXT
2008     Dodge  Caliber SXT
2008     Dodge  Avenger SXT

Trim has some missing values. What I would like to do is something like the following:

  • Group by year make and model
  • Impute Trim if there are missing valyes for that group

So for instance, I would look at all the 2007 Acura TL. That might look like

 Year      Make   Model Trim
2007     Acura  TL      Base
2007     Acura  TL      XLR
2007     Acura  TL      NaN
2007     Acura  TL      Base

Then, I would impute the Nan with Base (since Base is the Mode). It is important to remember here that I want to do this for every group of Year, Make, and Model.

回答1:

Use groupby then mode. Note that mode returns an array and you want to grab the first element of it. @John Galt deserves credit for this and gets my upvote.

I use assign to create a copy of df with an overwritten version of the Trim column.

df.assign(
    Trim=df.groupby(
        ['Year', 'Make', 'Model']
    ).Trim.apply(
        lambda x: x.fillna(x.mode()[0])
    )
)

   Year   Make Model  Trim
0  2007  Acura    TL  Base
1  2007  Acura    TL   XLR
2  2007  Acura    TL  Base
3  2007  Acura    TL  Base

You can overwrite the column directly with

df['Trim'] = df.groupby(
    ['Year', 'Make', 'Model']
).Trim.apply(
    lambda x: x.fillna(x.mode()[0])
)


回答2:

Use mode

In [215]: df
Out[215]:
   Year   Make    Model  Trim
0  2007  Acura       TL  Base
1  2010  Dodge  Avenger   SXT
2  2009  Dodge  Caliber   NaN
3  2008  Dodge  Caliber   SXT
4  2008  Dodge  Avenger   SXT

In [216]: df.Trim.fillna(df.Trim.mode()[0])
Out[216]:
0    Base
1     SXT
2     SXT
3     SXT
4     SXT
Name: Trim, dtype: object

Use inplace=True to actually set

In [217]: df.Trim.fillna(df.Trim.mode()[0], inplace=True)

In [218]: df
Out[218]:
   Year   Make    Model  Trim
0  2007  Acura       TL  Base
1  2010  Dodge  Avenger   SXT
2  2009  Dodge  Caliber   SXT
3  2008  Dodge  Caliber   SXT
4  2008  Dodge  Avenger   SXT

If you're working on groups

In [227]: df
Out[227]:
   Year   Make Model  Trim
0  2007  Acura    TL  Base
1  2007  Acura    TL   XLR
2  2007  Acura    TL   NaN
3  2007  Acura    TL  Base

In [228]: (df.groupby(['Year', 'Make', 'Model'])['Trim']
             .apply(lambda x: x.fillna(x.mode()[0])))
     ...:
Out[228]:
0    Base
1     XLR
2    Base
3    Base
Name: Trim, dtype: object