I am attempting to perform fastclust on a very large set of distances, but running into a problem.
I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords (about 50,000 unique keywords) that when I read into a data.frame looks like:
> df
kwd1 kwd2 similarity
a b 1
b a 1
c a 2
a c 2
It is a sparse list and I can convert it into a sparse matrix using sparseMatrix():
> myMatrix
a b c
a . . .
b 1 . .
c 2 . .
However, when I attempt to turn it into a dist object using as.dist(), I get the error that the 'problem is too large' from R. I have read the other dist questions on here, but the code others have suggested does not work for my above example data set.
Thanks for any help!
While using a sparse matrix in the first place seems like a good idea, I think there is a bit of a problem with that approach: your missing distances will be coded as 0
s, not as NA
s (see Creating (and Accessing) a Sparse Matrix with NA default entries). As you know, when clustering, a zero dissimilarity has a totally different meaning than a missing one...
So anyway, what you need is a dist object with a lot of NA
s for your missing dissimilarities. Unfortunately, your problem is so big that it would require too much memory:
d <- dist(x = rep(NA_integer_, 50000))
# Error: cannot allocate vector of size 9.3 Gb
And that's only dealing with the input... Even with a 64 bit machine with a lot of memory, I'm not sure the clustering algorithm itself would not choke or run indefinitely.
You should consider breaking your problem into smaller pieces.