I have a large pandas data fame df
. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas
and/or scikit
unfortunately doens't do the trick).
I came across what seems to be a neat package called fancyimpute
(you can find it here). But I have some problems with it.
Here is what I do:
#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
However, df_filled
is a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?
Update
I realized, fancyimpute
needs a numpay array
. I hence converted the df_numeric
to a an array using as_matrix()
.
# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?
The
np.array
that is returned by the.complete()
method of the fancyimpute object (be it mice or KNN) is fed as the content(argument data=)
of a pandas dataframe whose cols and indexes are the same as the original data frame.Add the following lines after your code:
I see the frustration with fancy impute and pandas. Here is a fairly basic wrapper using the recursive override method. Takes in and outputs a dataframe - column names intact. These sort of wrappers work well with pipelines.