I am trying to cluster a Matrix (size: 20057x2).:
T = clusterdata(X,cutoff);
but I get this error:
??? Error using ==> pdistmex Out of memory. Type HELP MEMORY for your options. Error in ==> pdist at 211 Y = pdistmex(X',dist,additionalArg); Error in ==> linkage at 139 Z = linkagemex(Y,method,pdistArg); Error in ==> clusterdata at 88 Z = linkage(X,linkageargs{1},pdistargs); Error in ==> kmeansTest at 2 T = clusterdata(X,1);
can someone help me. I have 4GB of ram, but think that the problem is from somewhere else..
PDIST calculates distances between all possible pairs of rows. If your data contain N=20057 rows, then number of pairs will be N*(N-1)/2, which is 201131596 in your case. Might be too much for your machine.
X
is too big to do on a 32 bit machine.pdist
is trying to make a 201,131,596 row vector (clusterdata
usespdist
) of doubles, which would use up about 1609MB (double
is 8 bytes) ... if you run it under windows with the /3GB switch you're limited to a maximum matrix size of 1536MB (see here).You're going to need to divide up the data someway instead of directly clustering all of it in one go.
As mentioned by others, hierarchical clustering needs to calculate the pairwise distance matrix which is too big to fit in memory in your case.
Try using the K-Means algorithm instead:
Alternatively you can select a random subset of your data and use as input to the clustering algorithm. Next you compute the cluster centers as mean/median of each cluster group. Finally for each instance that was not selected in the subset, you simply compute its distance to each of the centroids and assign it to the closest one.
Here's a sample code to illustrate the idea above: