I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer
(replacing NaN by the most frequent value). The problem is in implementation.
Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature.
Once I run:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df)
Python generates an error: 'could not convert string to float: 'run1''
, where 'run1' is an ordinary (non-missing) value from the first column with categorical data.
Any help would be very welcome
You can use
sklearn_pandas.CategoricalImputer
for the categorical columns. Details:First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the
full_pipeline.fit_transform()
takes a pandas DataFrame):You can then combine these sub pipelines with
sklearn.pipeline.FeatureUnion
, for example:Now, in the
num_pipeline
you can simply usesklearn.preprocessing.Imputer()
, but in thecat_pipline
, you can useCategoricalImputer()
from thesklearn_pandas
package.note:
sklearn-pandas
package can be installed withpip install sklearn-pandas
, but it is imported asimport sklearn_pandas
To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.
which prints,