可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I get a memory error when trying to call sklearn.metrics.silhouette_samples. My use case is identical to this tutorial. I am using scikit-learn 0.18.1 in Python 3.5.
For the related function, silhouette_score , this post suggests the use of the sample_size parameter which reduces the sample size before calling silhouette_samples. I am not sure that the down-sampling would still produce reliable results so I hesitate to do that.
My input, X, is a [107545 rows x 12 columns] dataframe which I would not really consider to be big, although I do only have 8gb of RAM
sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
167 check_number_of_labels(len(le.classes_), X.shape[0])
168
--> 169 distances = pairwise_distances(X, metric=metric, **kwds)
170 unique_labels = le.classes_
171 n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1245 func = partial(distance.cdist, metric=metric, **kwds)
1246
-> 1247 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1248
1249
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1088 if n_jobs == 1:
1089 # Special case to avoid picklability checks in delayed
-> 1090 return func(X, Y, **kwds)
1091
1092 # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
244 YY = row_norms(Y, squared=True)[np.newaxis, :]
245
--> 246 distances = safe_sparse_dot(X, Y.T, dense_output=True)
247 distances *= -2
248 distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
MemoryError:
The calculation seems to rely on euclidean_distances which crashed on the call of np.dot. I am not dealing with scarcity here so maybe there is no solution. When calculating distance I normally use numpy.linalg.norm(A-B). Does this have better memory handling?
回答1:
Update: PR 11135 should resolve this issue within scikit-learn, making the rest of the post obsolete.
You have about 100000 = 1e5 samples, which are points in 12-dimensional space. The pairwise_distances
method is trying to compute all pairwise distances between them. That is (1e5)**2 = 1e10 distances. Each is a floating point number; float64 format takes 8 bytes of memory. So the size of the distance matrix is 8e10 bytes, which is 74.5 GB.
This is occasionally reported on GitHub: #4701, #4197 with the answer being roughly: it's a NumPy problem that it can't handle np.dot
with matrices of that size. Although there was one comment saying
it might be possible to break this up into sub-matrices to do the calculation more memory efficient.
Indeed, if instead of forming one giant distance matrix at the beginning, the method computed relevant chunks of it in the loop over labels, that would require less memory.
It is not hard to modify the method using its source so that instead of computing distances first and applying binary masks later, it masks first. This is what I did below. Instead of N**2
memory, where N is the number of samples, it requires n**2
where n is the maximal cluster size.
If this is something that looks practical, I imagine it could be added to Scikit by way of some flag... one should note that this version does not support metric='precomputed'
, though.
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
check_number_of_labels(len(le.classes_), X.shape[0])
unique_labels = le.classes_
n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
# For sample i, store the mean distance of the cluster to which
# it belongs in intra_clust_dists[i]
intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)
# For sample i, store the mean distance of the second closest
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf + intra_clust_dists
for curr_label in range(len(unique_labels)):
# Find inter_clust_dist for all samples belonging to the same
# label.
mask = labels == curr_label
# Leave out current sample.
n_samples_curr_lab = n_samples_per_label[curr_label] - 1
if n_samples_curr_lab != 0:
intra_distances = pairwise_distances(X[mask, :], metric=metric, **kwds)
intra_clust_dists[mask] = np.sum(intra_distances, axis=1) / n_samples_curr_lab
# Now iterate over all other labels, finding the mean
# cluster distance that is closest to every sample.
for other_label in range(len(unique_labels)):
if other_label != curr_label:
other_mask = labels == other_label
inter_distances = pairwise_distances(X[mask, :], X[other_mask, :], metric=metric, **kwds)
other_distances = np.mean(inter_distances, axis=1)
inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)
sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples
回答2:
I developed a memory-efficient and relatively fast solution for the euclidean distance cast using numba. This works with roughly constant memory relative to the size of the input data, and uses numba's automatic parallelization. With it a dataset of 300000 rows in 24 dimensions (which would have needed about 720GB of RAM). This can be modified to implement other distance metrics as needed.
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
from numba import jit
@jit(nogil=True, parallel=True)
def euclidean_distances_numba(X, Y=None, Y_norm_squared=None):
# disable checks
XX_ = (X * X).sum(axis=1)
XX = XX_.reshape((1, -1))
if X is Y: # shortcut in the common case euclidean_distances(X, X)
YY = XX.T
elif Y_norm_squared is not None:
YY = Y_norm_squared
else:
YY_ = np.sum(Y * Y, axis=1)
YY = YY_.reshape((1,-1))
distances = np.dot(X, Y.T)
distances *= -2
distances += XX
distances += YY
distances = np.maximum(distances, 0)
return np.sqrt(distances)
@jit(parallel=True)
def euclidean_distances_sum(X, Y=None):
if Y is None:
Y = X
Y_norm_squared = (Y ** 2).sum(axis=1)
sums = np.zeros((len(X)))
for i in range(len(X)):
base_row = X[i, :]
sums[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).sum()
return sums
@jit(parallel=True)
def euclidean_distances_mean(X, Y=None):
if Y is None:
Y = X
Y_norm_squared = (Y ** 2).sum(axis=1)
means = np.zeros((len(X)))
for i in range(len(X)):
base_row = X[i, :]
means[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).mean()
return means
def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
check_number_of_labels(len(le.classes_), X.shape[0])
unique_labels = le.classes_
n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
# For sample i, store the mean distance of the cluster to which
# it belongs in intra_clust_dists[i]
intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)
# For sample i, store the mean distance of the second closest
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf + intra_clust_dists
for curr_label in range(len(unique_labels)):
# Find inter_clust_dist for all samples belonging to the same label.
mask = labels == curr_label
# Leave out current sample.
n_samples_curr_lab = n_samples_per_label[curr_label] - 1
if n_samples_curr_lab != 0:
intra_clust_dists[mask] = euclidean_distances_sum(X[mask, :]) / n_samples_curr_lab
# Now iterate over all other labels, finding the mean
# cluster distance that is closest to every sample.
for other_label in range(len(unique_labels)):
if other_label != curr_label:
other_mask = labels == other_label
other_distances = euclidean_distances_mean(X[mask, :], X[other_mask, :])
inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)
sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples
回答3:
The accepted answer is much better on memory than the official function. It goes from len(data)^2 to len(cluster)^2. If you have clusters large enough then this can still pose an issue. I wrote the following, which is ~len(data) but it is terribly slow.
import numpy as np
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
def silhouette_samples_newest(X, labels, metric='euclidean', **kwds):
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
unique_labels = le.classes_
check_number_of_labels(len(unique_labels), X.shape[0])
n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
intra_clust_dists = np.array([np.linalg.norm( X[(labels == labels[i]), :] - point, axis = 1).mean() for i, point in enumerate(X)])
inter_clust_dists = np.array([min([np.linalg.norm( X[(labels == label), :] - point, axis = 1).mean() for label in unique_labels if label!=labels[i]]) for i, point in enumerate(X)])
sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples