Matrix completion in Python

Say I have a matrix:

> import numpy as nap
> a = np.random.random((5,5))

array([[ 0.28164485,  0.76200749,  0.59324211,  0.15201506,  0.74084168],
       [ 0.83572213,  0.63735993,  0.28039542,  0.19191284,  0.48419414],
       [ 0.99967476,  0.8029097 ,  0.53140614,  0.24026153,  0.94805153],
       [ 0.92478   ,  0.43488547,  0.76320656,  0.39969956,  0.46490674],
       [ 0.83315135,  0.94781119,  0.80455425,  0.46291229,  0.70498372]])

And that I punch some holes in it with np.NaN, e.g.:

> a[(1,4,0,3),(2,4,2,0)] = np.NaN; 

array([[ 0.80327707,  0.87722234,         nan,  0.94463778,  0.78089194],
       [ 0.90584284,  0.18348667,         nan,  0.82401826,  0.42947815],
       [ 0.05913957,  0.15512961,  0.08328608,  0.97636309,  0.84573433],
       [        nan,  0.30120861,  0.46829231,  0.52358888,  0.89510461],
       [ 0.19877877,  0.99423591,  0.17236892,  0.88059185,        nan ]])

I would like to fill-in the nan entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan entries occur.

More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).

Background:

This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)

标签： python numpy machine-learning scikit-learn mathematical-optimization

5条回答

该账号已被封号

2楼-- · 2020-07-06 05:13

The exact method you desire (Candes and Recht, 2008) is available for Python in the fancyimpute library, located here (link).

from fancyimpute import NuclearNormMinimization

# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN

X_filled_nnm = NuclearNormMinimization().complete(X_incomplete)

I've seen good results from it. Thankfully, they changed the autodiff and SGD backend from downhill, which uses Theano under the hood, to keras over the past year. The algorithm is available in this library too (link). SciKit-Learn's Imputer() does not include this algorithm. It's not in the documentation, but you can install fancyimpute with pip:

pip install fancyimpute

0人赞添加讨论(0) 举报

贪生不怕死

3楼-- · 2020-07-06 05:17

You can do this quite simply with pandas

import pandas as pd

DF = pd.DataFrame(a)
col_means = DF.apply(np.mean, 0)
DF.fillna(value=col_means)

0人赞添加讨论(0) 举报

等我变得足够好

4楼-- · 2020-07-06 05:19

You can do it with pure numpy, but its nastier.

from scipy.stats import nanmean
>>> a
array([[ 0.70309466,  0.53785006,         nan,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,         nan,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [        nan,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,         nan]])


>>> mean=nanmean(a,axis=0)
>>> mean
array([ 0.42092677,  0.52158153,  0.56239323,  0.58094958,  0.41881841])
>>> index=np.where(np.isnan(a))

>>> a[index]=np.take(mean,index[1])
>>> a
array([[ 0.70309466,  0.53785006,  0.56239323,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,  0.56239323,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [ 0.42092677,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,  0.41881841]])

Running some timings:

import time
import numpy as np
import pandas as pd
from scipy.stats import nanmean

a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)


%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 1.84 s per loop

%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.81 s per loop

#Surprisingly, issue could be apply looping over the zero axis.
DF=pd.DataFrame(a2)
%timeit col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.57 s per loop

I do not believe numpy has array completion routines built in; however, pandas does. View the help topic here.

0人赞添加讨论(0) 举报

该账号已被封号

5楼-- · 2020-07-06 05:22

If you install the latest scikit-learn, version 0.14a1, you can use its shiny new Imputer class:

>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(strategy="mean")
>>> a = np.random.random((5,5))
>>> a[(1,4,0,3),(2,4,2,0)] = np.nan
>>> a
array([[ 0.77473361,  0.62987193,         nan,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,         nan,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [        nan,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,         nan]])
>>> a = imp.fit_transform(a)
>>> a
array([[ 0.77473361,  0.62987193,  0.24346087,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,  0.24346087,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [ 0.51259188,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,  0.30317394]])

After this, you can use imp.transform to do the same transformation to other data, using the mean that imp learned from a. Imputers tie into scikit-learn Pipeline objects so you can use them in classification or regression pipelines.

If you want to wait for a stable release, then 0.14 should be out next week.

Full disclosure: I'm a scikit-learn core developer

0人赞添加讨论(0) 举报

贼婆χ

6楼-- · 2020-07-06 05:26

Similar questions have been asked here before. What you need is a special case of inpaiting. Unfortunately, neither numpy or scipy have builtin routines for this. However, OpenCV has a function inpaint(), but it only works on 8-bit images.

OpenPIV has a replace_nans function that you can use for your purposes. (See here for Cython version that you can repackage if you don't want to install the whole library.) It is more flexible than a pure mean or propagation of older values as suggested in other answers (e.g., you can defined different weighting functions, kernel sizes, etc.).

Using the examples from @Ophion, I compared the replace_nans with the nanmean and Pandas solutions:

import numpy as np
import pandas as pd
from scipy.stats import nanmean

a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)

%timeit new_array = replace_nans(a1, 10, 0.5, 1.)
1 loops, best of 3: 1.57 s per loop

%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 2.23 s per loop

%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 7.23 s per loop

The replace_nans solution is arguably better and faster.

0人赞添加讨论(0) 举报

Matrix completion in Python

Background:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间