MemoryError from sklearn.metrics.silhouette_sample

2019-08-06 04:45发布

站内文章 / Python

52 0

闹够了就滚

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I get a memory error when trying to call sklearn.metrics.silhouette_samples. My use case is identical to this tutorial. I am using scikit-learn 0.18.1 in Python 3.5.

For the related function, silhouette_score , this post suggests the use of the sample_size parameter which reduces the sample size before calling silhouette_samples. I am not sure that the down-sampling would still produce reliable results so I hesitate to do that.

My input, X, is a [107545 rows x 12 columns] dataframe which I would not really consider to be big, although I do only have 8gb of RAM

sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])

C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
    167     check_number_of_labels(len(le.classes_), X.shape[0])
    168 
--> 169     distances = pairwise_distances(X, metric=metric, **kwds)
    170     unique_labels = le.classes_
    171     n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1245         func = partial(distance.cdist, metric=metric, **kwds)
   1246 
-> 1247     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1248 
   1249 
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1088     if n_jobs == 1:
   1089         # Special case to avoid picklability checks in delayed
-> 1090         return func(X, Y, **kwds)
   1091 
   1092     # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    244         YY = row_norms(Y, squared=True)[np.newaxis, :]
    245 
--> 246     distances = safe_sparse_dot(X, Y.T, dense_output=True)
    247     distances *= -2
    248     distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 
MemoryError:

The calculation seems to rely on euclidean_distances which crashed on the call of np.dot. I am not dealing with scarcity here so maybe there is no solution. When calculating distance I normally use numpy.linalg.norm(A-B). Does this have better memory handling?

回答1:

Update: PR 11135 should resolve this issue within scikit-learn, making the rest of the post obsolete.

You have about 100000 = 1e5 samples, which are points in 12-dimensional space. The pairwise_distances method is trying to compute all pairwise distances between them. That is (1e5)**2 = 1e10 distances. Each is a floating point number; float64 format takes 8 bytes of memory. So the size of the distance matrix is 8e10 bytes, which is 74.5 GB.

This is occasionally reported on GitHub: #4701, #4197 with the answer being roughly: it's a NumPy problem that it can't handle np.dot with matrices of that size. Although there was one comment saying

it might be possible to break this up into sub-matrices to do the calculation more memory efficient.

Indeed, if instead of forming one giant distance matrix at the beginning, the method computed relevant chunks of it in the loop over labels, that would require less memory.

It is not hard to modify the method using its source so that instead of computing distances first and applying binary masks later, it masks first. This is what I did below. Instead of N**2 memory, where N is the number of samples, it requires n**2 where n is the maximal cluster size.

If this is something that looks practical, I imagine it could be added to Scikit by way of some flag... one should note that this version does not support metric='precomputed', though.

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    check_number_of_labels(len(le.classes_), X.shape[0])

    unique_labels = le.classes_
    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    # For sample i, store the mean distance of the cluster to which
    # it belongs in intra_clust_dists[i]
    intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)

    # For sample i, store the mean distance of the second closest
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf + intra_clust_dists

    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same
        # label.
        mask = labels == curr_label

        # Leave out current sample.
        n_samples_curr_lab = n_samples_per_label[curr_label] - 1
        if n_samples_curr_lab != 0:
            intra_distances = pairwise_distances(X[mask, :], metric=metric, **kwds)
            intra_clust_dists[mask] = np.sum(intra_distances, axis=1) / n_samples_curr_lab

        # Now iterate over all other labels, finding the mean
        # cluster distance that is closest to every sample.
        for other_label in range(len(unique_labels)):
            if other_label != curr_label:
                other_mask = labels == other_label
                inter_distances = pairwise_distances(X[mask, :], X[other_mask, :], metric=metric, **kwds)
                other_distances = np.mean(inter_distances, axis=1)
                inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples

回答2:

I developed a memory-efficient and relatively fast solution for the euclidean distance cast using numba. This works with roughly constant memory relative to the size of the input data, and uses numba's automatic parallelization. With it a dataset of 300000 rows in 24 dimensions (which would have needed about 720GB of RAM). This can be modified to implement other distance metrics as needed.

from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

from numba import jit

@jit(nogil=True, parallel=True)
def euclidean_distances_numba(X, Y=None, Y_norm_squared=None):
    # disable checks
    XX_ = (X * X).sum(axis=1)
    XX = XX_.reshape((1, -1))

    if X is Y:  # shortcut in the common case euclidean_distances(X, X)
        YY = XX.T
    elif Y_norm_squared is not None:
        YY = Y_norm_squared
    else:
        YY_ = np.sum(Y * Y, axis=1)
        YY = YY_.reshape((1,-1))

    distances = np.dot(X, Y.T)
    distances *= -2
    distances += XX
    distances += YY
    distances = np.maximum(distances, 0)

    return np.sqrt(distances)

@jit(parallel=True)
def euclidean_distances_sum(X, Y=None):
    if Y is None:
        Y = X
    Y_norm_squared = (Y ** 2).sum(axis=1)
    sums = np.zeros((len(X)))
    for i in range(len(X)):
        base_row = X[i, :]
        sums[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).sum()

    return sums


@jit(parallel=True)
def euclidean_distances_mean(X, Y=None):
    if Y is None:
        Y = X
    Y_norm_squared = (Y ** 2).sum(axis=1)
    means = np.zeros((len(X)))
    for i in range(len(X)):
        base_row = X[i, :]
        means[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).mean()

    return means


def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    check_number_of_labels(len(le.classes_), X.shape[0])

    unique_labels = le.classes_
    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    # For sample i, store the mean distance of the cluster to which
    # it belongs in intra_clust_dists[i]
    intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)

    # For sample i, store the mean distance of the second closest
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf + intra_clust_dists

    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same label.
        mask = labels == curr_label

        # Leave out current sample.
        n_samples_curr_lab = n_samples_per_label[curr_label] - 1
        if n_samples_curr_lab != 0:
            intra_clust_dists[mask] = euclidean_distances_sum(X[mask, :]) / n_samples_curr_lab

        # Now iterate over all other labels, finding the mean
        # cluster distance that is closest to every sample.
        for other_label in range(len(unique_labels)):
            if other_label != curr_label:
                other_mask = labels == other_label
                other_distances = euclidean_distances_mean(X[mask, :], X[other_mask, :])
                inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples

回答3:

The accepted answer is much better on memory than the official function. It goes from len(data)^2 to len(cluster)^2. If you have clusters large enough then this can still pose an issue. I wrote the following, which is ~len(data) but it is terribly slow.

import numpy as np
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

def silhouette_samples_newest(X, labels, metric='euclidean', **kwds):                                                
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    unique_labels = le.classes_
    check_number_of_labels(len(unique_labels), X.shape[0])

    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    intra_clust_dists = np.array([np.linalg.norm( X[(labels == labels[i]), :] - point, axis = 1).mean() for i, point in enumerate(X)])
    inter_clust_dists = np.array([min([np.linalg.norm( X[(labels == label), :] - point, axis = 1).mean() for label in unique_labels if label!=labels[i]]) for i, point in enumerate(X)])

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples