I am currently working on clustering some big data, about 30k rows, the dissimilarity matrix just too big for R to handle, I think this is not purely memory size problem. Maybe there are some smart way to do this?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
If your data is so large that base R can't easily cope, then you have several options:
- Work on a machine with more RAM.
- Use a commercial product, e.g. Revolution Analytics that supports working with larger data with R.
Here is an example using RevoScaleR
the commercial package by Revolution. I use the dataset diamonds
, part of ggplot2
since this contains 53K rows, i.e. a bit larger than your data. The example doesn't make much analytic sense, since I naively convert factors into numerics, but it illustrates the computation on a laptop:
library(ggplot2)
library(RevoScaleR)
artificial <- as.data.frame(sapply(diamonds, as.numeric))
clusters <- rxKmeans(~carat + cut + color + clarity + price,
data=artificial, numClusters=6)
clusters$centers
This results in:
carat cut color clarity price
1 0.3873094 4.073170 3.294146 4.553910 932.6134
2 1.9338503 3.873151 4.285970 3.623935 16171.7006
3 1.0529018 3.655348 3.866056 3.135403 4897.1073
4 0.7298475 3.794888 3.486457 3.899821 2653.7674
5 1.2653675 3.879387 4.025984 4.065154 7777.0613
6 1.5808225 3.904489 4.066285 4.066285 11562.5788