Python: computing pariwise distances causes memory

2019-05-19 03:46发布

问题:

I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances.

from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).

However, this causes a memory error.

   Traceback (most recent call last):
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module>
    result_clustering = clf_clustering.getCVResult(shuffle)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
    self.centroids_of_categories(X_train, y_train)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
    cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
    distance_dict, rho_dict = self.compute_distances_and_rhos(X)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
    pairwise_distances = pdist(X, 'cosine')
  File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError

The RAM of my laptop is 16GB. How should I fix it? Or is there any better way?

回答1:

Doing matrix-based algorithms on large data sets is prohibitive.

The memory requirements are straightforward to estimate. Even with exploiting symmetry, many implementations will max out at about 65000 instances. But even 64 bit implementations and big machines will eventually give up. A 1000000x1000000 matrix with double precision and exploiting symmetry needs 4 TB of RAM.

Use better algorithms that don't need O(n^2) memory and runtime.