Say I have a matrix:
> import numpy as nap
> a = np.random.random((5,5))
array([[ 0.28164485, 0.76200749, 0.59324211, 0.15201506, 0.74084168],
[ 0.83572213, 0.63735993, 0.28039542, 0.19191284, 0.48419414],
[ 0.99967476, 0.8029097 , 0.53140614, 0.24026153, 0.94805153],
[ 0.92478 , 0.43488547, 0.76320656, 0.39969956, 0.46490674],
[ 0.83315135, 0.94781119, 0.80455425, 0.46291229, 0.70498372]])
And that I punch some holes in it with np.NaN
, e.g.:
> a[(1,4,0,3),(2,4,2,0)] = np.NaN;
array([[ 0.80327707, 0.87722234, nan, 0.94463778, 0.78089194],
[ 0.90584284, 0.18348667, nan, 0.82401826, 0.42947815],
[ 0.05913957, 0.15512961, 0.08328608, 0.97636309, 0.84573433],
[ nan, 0.30120861, 0.46829231, 0.52358888, 0.89510461],
[ 0.19877877, 0.99423591, 0.17236892, 0.88059185, nan ]])
I would like to fill-in the nan
entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan
entries occur.
More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).
Background:
This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)
The exact method you desire (Candes and Recht, 2008) is available for Python in the
fancyimpute
library, located here (link).I've seen good results from it. Thankfully, they changed the autodiff and SGD backend from
downhill
, which usesTheano
under the hood, tokeras
over the past year. The algorithm is available in this library too (link). SciKit-Learn'sImputer()
does not include this algorithm. It's not in the documentation, but you can installfancyimpute
withpip
:You can do this quite simply with
pandas
You can do it with pure numpy, but its nastier.
Running some timings:
I do not believe numpy has array completion routines built in; however, pandas does. View the help topic here.
If you install the latest scikit-learn, version 0.14a1, you can use its shiny new
Imputer
class:After this, you can use
imp.transform
to do the same transformation to other data, using the mean thatimp
learned froma
. Imputers tie into scikit-learnPipeline
objects so you can use them in classification or regression pipelines.If you want to wait for a stable release, then 0.14 should be out next week.
Full disclosure: I'm a scikit-learn core developer
Similar questions have been asked here before. What you need is a special case of inpaiting. Unfortunately, neither numpy or scipy have builtin routines for this. However, OpenCV has a function
inpaint()
, but it only works on 8-bit images.OpenPIV has a
replace_nans
function that you can use for your purposes. (See here for Cython version that you can repackage if you don't want to install the whole library.) It is more flexible than a pure mean or propagation of older values as suggested in other answers (e.g., you can defined different weighting functions, kernel sizes, etc.).Using the examples from @Ophion, I compared the
replace_nans
with thenanmean
and Pandas solutions:The
replace_nans
solution is arguably better and faster.