Kmeans on a million observations in R - trouble pl

2019-02-23 13:09发布

问题:

I am trying to perform KMeans clustering on over a million rows with 4 observations, all numeric. I am using the following code:

kmeansdf<-as.data.frame(rbind(train$V3,train$V5,train$V8,train$length))
km<-kmeans(kmeansdf,2)

As it can be seen, I would like to divide my data into two clusters. The object km is getting populated but I am having trouble plotting the results. Here is the code I am using to plot:

plot(kmeansdf,col=km$cluster)

This piece of code gives me the following error:

Error in plot.new() : figure margins too large

I tried researching online but could not find a solution, I tried working on command line as well but still getting the same error (I am using RStudio at the moment)

Any help to resolve the error would be highly appreciated. TIA.

回答1:

When I run your code on a df with 1e6 rows, I don't get the same error, but the system hangs (interrupted after 10 min). It may be that creating a scatterplot matrix with 1e6 points per frame is just too much.

You might consider taking a random sample:

# all this to create a df with two distinct clusters
set.seed(1)
center.1 <- c(2,2,2,2)
center.2 <- c(-2,-2,-2,-2)
n <- 5e5
f <- function(x){return(data.frame(V1=rnorm(n,mean=x[1]),
                                   V2=rnorm(n,mean=x[2]),
                                   V3=rnorm(n,mean=x[3]),
                                   V4=rnorm(n,mean=x[4])))}
df <- do.call("rbind",lapply(list(center.1,center.2),f))

km <- kmeans(df,2)         # run kmeans on full dataset
df$cluster <- km$cluster   # append cluster column to df

# sample is 10% of population (100,000 rows)
s  <- 1e5
df <- df[sample(nrow(df),s),]
plot(df[,1:4],col=df$cluster)