Matrix completion in Python

2020-07-06 05:24发布

问题:

Say I have a matrix:

> import numpy as nap
> a = np.random.random((5,5))

array([[ 0.28164485,  0.76200749,  0.59324211,  0.15201506,  0.74084168],
       [ 0.83572213,  0.63735993,  0.28039542,  0.19191284,  0.48419414],
       [ 0.99967476,  0.8029097 ,  0.53140614,  0.24026153,  0.94805153],
       [ 0.92478   ,  0.43488547,  0.76320656,  0.39969956,  0.46490674],
       [ 0.83315135,  0.94781119,  0.80455425,  0.46291229,  0.70498372]])

And that I punch some holes in it with np.NaN, e.g.:

> a[(1,4,0,3),(2,4,2,0)] = np.NaN; 

array([[ 0.80327707,  0.87722234,         nan,  0.94463778,  0.78089194],
       [ 0.90584284,  0.18348667,         nan,  0.82401826,  0.42947815],
       [ 0.05913957,  0.15512961,  0.08328608,  0.97636309,  0.84573433],
       [        nan,  0.30120861,  0.46829231,  0.52358888,  0.89510461],
       [ 0.19877877,  0.99423591,  0.17236892,  0.88059185,        nan ]])

I would like to fill-in the nan entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan entries occur.

More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).

Background:

This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)

回答1:

If you install the latest scikit-learn, version 0.14a1, you can use its shiny new Imputer class:

>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(strategy="mean")
>>> a = np.random.random((5,5))
>>> a[(1,4,0,3),(2,4,2,0)] = np.nan
>>> a
array([[ 0.77473361,  0.62987193,         nan,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,         nan,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [        nan,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,         nan]])
>>> a = imp.fit_transform(a)
>>> a
array([[ 0.77473361,  0.62987193,  0.24346087,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,  0.24346087,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [ 0.51259188,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,  0.30317394]])

After this, you can use imp.transform to do the same transformation to other data, using the mean that imp learned from a. Imputers tie into scikit-learn Pipeline objects so you can use them in classification or regression pipelines.

If you want to wait for a stable release, then 0.14 should be out next week.

Full disclosure: I'm a scikit-learn core developer



回答2:

You can do it with pure numpy, but its nastier.

from scipy.stats import nanmean
>>> a
array([[ 0.70309466,  0.53785006,         nan,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,         nan,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [        nan,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,         nan]])


>>> mean=nanmean(a,axis=0)
>>> mean
array([ 0.42092677,  0.52158153,  0.56239323,  0.58094958,  0.41881841])
>>> index=np.where(np.isnan(a))

>>> a[index]=np.take(mean,index[1])
>>> a
array([[ 0.70309466,  0.53785006,  0.56239323,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,  0.56239323,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [ 0.42092677,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,  0.41881841]])

Running some timings:

import time
import numpy as np
import pandas as pd
from scipy.stats import nanmean

a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)


%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 1.84 s per loop

%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.81 s per loop

#Surprisingly, issue could be apply looping over the zero axis.
DF=pd.DataFrame(a2)
%timeit col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.57 s per loop

I do not believe numpy has array completion routines built in; however, pandas does. View the help topic here.



回答3:

You can do this quite simply with pandas

import pandas as pd

DF = pd.DataFrame(a)
col_means = DF.apply(np.mean, 0)
DF.fillna(value=col_means)


回答4:

Similar questions have been asked here before. What you need is a special case of inpaiting. Unfortunately, neither numpy or scipy have builtin routines for this. However, OpenCV has a function inpaint(), but it only works on 8-bit images.

OpenPIV has a replace_nans function that you can use for your purposes. (See here for Cython version that you can repackage if you don't want to install the whole library.) It is more flexible than a pure mean or propagation of older values as suggested in other answers (e.g., you can defined different weighting functions, kernel sizes, etc.).

Using the examples from @Ophion, I compared the replace_nans with the nanmean and Pandas solutions:

import numpy as np
import pandas as pd
from scipy.stats import nanmean

a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)

%timeit new_array = replace_nans(a1, 10, 0.5, 1.)
1 loops, best of 3: 1.57 s per loop

%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 2.23 s per loop

%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 7.23 s per loop

The replace_nans solution is arguably better and faster.



回答5:

The exact method you desire (Candes and Recht, 2008) is available for Python in the fancyimpute library, located here (link).

from fancyimpute import NuclearNormMinimization

# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN

X_filled_nnm = NuclearNormMinimization().complete(X_incomplete)

I've seen good results from it. Thankfully, they changed the autodiff and SGD backend from downhill, which uses Theano under the hood, to keras over the past year. The algorithm is available in this library too (link). SciKit-Learn's Imputer() does not include this algorithm. It's not in the documentation, but you can install fancyimpute with pip:

pip install fancyimpute