I get a memory error when trying to call sklearn.metrics.silhouette_samples. My use case is identical to this tutorial. I am using scikit-learn 0.18.1 in Python 3.5.
For the related function, silhouette_score , this post suggests the use of the sample_size parameter which reduces the sample size before calling silhouette_samples. I am not sure that the down-sampling would still produce reliable results so I hesitate to do that.
My input, X, is a [107545 rows x 12 columns] dataframe which I would not really consider to be big, although I do only have 8gb of RAM
sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
167 check_number_of_labels(len(le.classes_), X.shape[0])
168
--> 169 distances = pairwise_distances(X, metric=metric, **kwds)
170 unique_labels = le.classes_
171 n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1245 func = partial(distance.cdist, metric=metric, **kwds)
1246
-> 1247 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1248
1249
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1088 if n_jobs == 1:
1089 # Special case to avoid picklability checks in delayed
-> 1090 return func(X, Y, **kwds)
1091
1092 # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
244 YY = row_norms(Y, squared=True)[np.newaxis, :]
245
--> 246 distances = safe_sparse_dot(X, Y.T, dense_output=True)
247 distances *= -2
248 distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
MemoryError:
The calculation seems to rely on euclidean_distances which crashed on the call of np.dot. I am not dealing with scarcity here so maybe there is no solution. When calculating distance I normally use numpy.linalg.norm(A-B). Does this have better memory handling?
Update: PR 11135 should resolve this issue within scikit-learn, making the rest of the post obsolete.
You have about 100000 = 1e5 samples, which are points in 12-dimensional space. The
pairwise_distances
method is trying to compute all pairwise distances between them. That is (1e5)**2 = 1e10 distances. Each is a floating point number; float64 format takes 8 bytes of memory. So the size of the distance matrix is 8e10 bytes, which is 74.5 GB.This is occasionally reported on GitHub: #4701, #4197 with the answer being roughly: it's a NumPy problem that it can't handle
np.dot
with matrices of that size. Although there was one comment sayingIndeed, if instead of forming one giant distance matrix at the beginning, the method computed relevant chunks of it in the loop over labels, that would require less memory.
It is not hard to modify the method using its source so that instead of computing distances first and applying binary masks later, it masks first. This is what I did below. Instead of
N**2
memory, where N is the number of samples, it requiresn**2
where n is the maximal cluster size.If this is something that looks practical, I imagine it could be added to Scikit by way of some flag... one should note that this version does not support
metric='precomputed'
, though.The accepted answer is much better on memory than the official function. It goes from len(data)^2 to len(cluster)^2. If you have clusters large enough then this can still pose an issue. I wrote the following, which is ~len(data) but it is terribly slow.
I developed a memory-efficient and relatively fast solution for the euclidean distance cast using numba. This works with roughly constant memory relative to the size of the input data, and uses numba's automatic parallelization. With it a dataset of 300000 rows in 24 dimensions (which would have needed about 720GB of RAM). This can be modified to implement other distance metrics as needed.