I want to compute the pairwise distances of 57832 vectors. Each vector has 200 dimensions. I am using pdist to compute the distances.
from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).
However, this causes a memory error.
Traceback (most recent call last):
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module>
result_clustering = clf_clustering.getCVResult(shuffle)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
self.centroids_of_categories(X_train, y_train)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
distance_dict, rho_dict = self.compute_distances_and_rhos(X)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
pairwise_distances = pdist(X, 'cosine')
File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
The RAM of my laptop is 16GB. How should I fix it? Or is there any better way?