I have a data set in which there is a column known as Native Country which contain around 30000
records. Some are missing represented by NaN
so I thought to fill it with mode()
value. I wrote something like this:
data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)
However when I do a count of missing values:
for col_name in data.columns:
print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))
It is still coming up with the same number of NaN
values for the column Native Country.
If we fill in the missing values with
fillna(df['colX'].mode())
, since the result ofmode()
is a Series, it will only fill in the first couple of rows for the matching indices. At least if done as below:However, by simply taking the first value of the Series
fillna(df['colX'].mode()[0])
, I think we risk introducing unintended bias in the data. If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse. For example, taking only0
if we have[0, 21, 99]
as the equally most frequent values. Or filling missing values withFalse
whenTrue
andFalse
values are equally frequent in a given column.I don't have a clear cut solution here. Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.
Be careful, NaN may be the mode of your dataframe: in this case, you are replacing NaN with another NaN.
Just call first element of series:
or you can do the same with assisgnment: