I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer
(replacing NaN by the most frequent value). The problem is in implementation.
Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature.
Once I run:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df)
Python generates an error: 'could not convert string to float: 'run1''
, where 'run1' is an ordinary (non-missing) value from the first column with categorical data.
Any help would be very welcome
strategy = 'most_frequent' can be used only with quantitative feature, not with qualitative. This custom impuer can be used for both qualitative and quantitative. Also with scikit learn imputer either we can use it for whole data frame(if all features are quantitative) or we can use 'for loop' with list of similar type of features/columns(see the below example). But custom imputer can be used with any combinations.
Custom Imputer :
Dataframe:
1) Can be used with list of similar type of features.
can be used with strategy = median
3) Can be used with whole data frame, it will use default mean(or we can also change it with median. for qualitative features it uses strategy = 'most_frequent' and for quantitative mean/median.
Copying and modifying sveitser's answer, I made an imputer for a pandas.Series object
To use it you would do:
Similar. Modify
Imputer
forstrategy='most_frequent'
:where
pandas.DataFrame.mode()
finds the most frequent value for each column and thenpandas.DataFrame.fillna()
fills missing values with these. Otherstrategy
values are still handled the same way byImputer
.There is a package
sklearn-pandas
which has option for imputation for categorical variable https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputerInspired by the answers here and for the want of a goto Imputer for all use-cases I ended up writing this. It supports four strategies for imputation
mean, mode, median, fill
works on bothpd.DataFrame
andPd.Series
.mean
andmedian
works only for numeric data,mode
andfill
works for both numeric and categorical data.usage
This code fills in a series with the most frequent category:
Outputs: