可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Say I have a matrix:
> import numpy as nap
> a = np.random.random((5,5))
array([[ 0.28164485, 0.76200749, 0.59324211, 0.15201506, 0.74084168],
[ 0.83572213, 0.63735993, 0.28039542, 0.19191284, 0.48419414],
[ 0.99967476, 0.8029097 , 0.53140614, 0.24026153, 0.94805153],
[ 0.92478 , 0.43488547, 0.76320656, 0.39969956, 0.46490674],
[ 0.83315135, 0.94781119, 0.80455425, 0.46291229, 0.70498372]])
And that I punch some holes in it with np.NaN
, e.g.:
> a[(1,4,0,3),(2,4,2,0)] = np.NaN;
array([[ 0.80327707, 0.87722234, nan, 0.94463778, 0.78089194],
[ 0.90584284, 0.18348667, nan, 0.82401826, 0.42947815],
[ 0.05913957, 0.15512961, 0.08328608, 0.97636309, 0.84573433],
[ nan, 0.30120861, 0.46829231, 0.52358888, 0.89510461],
[ 0.19877877, 0.99423591, 0.17236892, 0.88059185, nan ]])
I would like to fill-in the nan
entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan
entries occur.
More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).
Background:
This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)
回答1:
If you install the latest scikit-learn, version 0.14a1, you can use its shiny new Imputer
class:
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(strategy="mean")
>>> a = np.random.random((5,5))
>>> a[(1,4,0,3),(2,4,2,0)] = np.nan
>>> a
array([[ 0.77473361, 0.62987193, nan, 0.11367791, 0.17633671],
[ 0.68555944, 0.54680378, nan, 0.64186838, 0.15563309],
[ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022],
[ nan, 0.54097945, 0.30680838, 0.82303869, 0.22784574],
[ 0.21223024, 0.06426663, 0.34254093, 0.22115931, nan]])
>>> a = imp.fit_transform(a)
>>> a
array([[ 0.77473361, 0.62987193, 0.24346087, 0.11367791, 0.17633671],
[ 0.68555944, 0.54680378, 0.24346087, 0.64186838, 0.15563309],
[ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022],
[ 0.51259188, 0.54097945, 0.30680838, 0.82303869, 0.22784574],
[ 0.21223024, 0.06426663, 0.34254093, 0.22115931, 0.30317394]])
After this, you can use imp.transform
to do the same transformation to other data, using the mean that imp
learned from a
. Imputers tie into scikit-learn Pipeline
objects so you can use them in classification or regression pipelines.
If you want to wait for a stable release, then 0.14 should be out next week.
Full disclosure: I'm a scikit-learn core developer
回答2:
You can do it with pure numpy, but its nastier.
from scipy.stats import nanmean
>>> a
array([[ 0.70309466, 0.53785006, nan, 0.49590115, 0.23521493],
[ 0.29067786, 0.48236186, nan, 0.93220001, 0.76261019],
[ 0.66243065, 0.07731947, 0.38887545, 0.56450533, 0.58647126],
[ nan, 0.7870873 , 0.60010096, 0.88778259, 0.09097726],
[ 0.02750389, 0.72328898, 0.69820328, 0.02435883, nan]])
>>> mean=nanmean(a,axis=0)
>>> mean
array([ 0.42092677, 0.52158153, 0.56239323, 0.58094958, 0.41881841])
>>> index=np.where(np.isnan(a))
>>> a[index]=np.take(mean,index[1])
>>> a
array([[ 0.70309466, 0.53785006, 0.56239323, 0.49590115, 0.23521493],
[ 0.29067786, 0.48236186, 0.56239323, 0.93220001, 0.76261019],
[ 0.66243065, 0.07731947, 0.38887545, 0.56450533, 0.58647126],
[ 0.42092677, 0.7870873 , 0.60010096, 0.88778259, 0.09097726],
[ 0.02750389, 0.72328898, 0.69820328, 0.02435883, 0.41881841]])
Running some timings:
import time
import numpy as np
import pandas as pd
from scipy.stats import nanmean
a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)
%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 1.84 s per loop
%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.81 s per loop
#Surprisingly, issue could be apply looping over the zero axis.
DF=pd.DataFrame(a2)
%timeit col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.57 s per loop
I do not believe numpy has array completion routines built in; however, pandas does. View the help topic here.
回答3:
You can do this quite simply with pandas
import pandas as pd
DF = pd.DataFrame(a)
col_means = DF.apply(np.mean, 0)
DF.fillna(value=col_means)
回答4:
Similar questions have been asked here before. What you need is a special case of inpaiting. Unfortunately, neither numpy or scipy have builtin routines for this. However, OpenCV has a function inpaint()
, but it only works on 8-bit images.
OpenPIV has a replace_nans
function that you can use for your purposes. (See here for Cython version that you can repackage if you don't want to install the whole library.) It is more flexible than a pure mean or propagation of older values as suggested in other answers (e.g., you can defined different weighting functions, kernel sizes, etc.).
Using the examples from @Ophion, I compared the replace_nans
with the nanmean
and Pandas solutions:
import numpy as np
import pandas as pd
from scipy.stats import nanmean
a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)
%timeit new_array = replace_nans(a1, 10, 0.5, 1.)
1 loops, best of 3: 1.57 s per loop
%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 2.23 s per loop
%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 7.23 s per loop
The replace_nans
solution is arguably better and faster.
回答5:
The exact method you desire (Candes and Recht, 2008) is available for Python in the fancyimpute
library, located here (link).
from fancyimpute import NuclearNormMinimization
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
X_filled_nnm = NuclearNormMinimization().complete(X_incomplete)
I've seen good results from it. Thankfully, they changed the autodiff and SGD backend from downhill
, which uses Theano
under the hood, to keras
over the past year. The algorithm is available in this library too (link). SciKit-Learn's Imputer()
does not include this algorithm. It's not in the documentation, but you can install fancyimpute
with pip
:
pip install fancyimpute