I wonder if anyone could have a look at the following code and minimal example and suggest improvements - in particular regarding efficiency of the code when working with really large data sets.
The function takes a data.frame and splits it by a grouping variable (factor) and then calculates the distance matrix for all the rows in each group.
I do not need to keep the distance matrices - only some statistics ie the mean, the histogram .., then they can be discarded.
I don't know much about memory allocation and the like and am wondering what would be the best way to do this, since I will be working with 10.000 - 100.000 of cases per group. Any thoughts will be greatly appreciated!
Also, what would be the least painful way of including bigmemory or some other large data handling package into the function as is in case I run into serious memory issues?
FactorDistances <- function(df) {
# df is the data frame where the first column is the grouping variable.
# find names and number of groups in df (in the example there are three:(2,3,4)
factor.names <- unique(df[1])
n.factors <-length(unique(df$factor))
# split df by factor into list - each subset dataframe is one list element
df.l<-list()
for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
# use lapply to go through list and calculate distance matrix for each group
# this results in a new list where each element is a distance matrix
distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))
# again use lapply to get the mean distance for each group
means <- lapply (distances, mean)
rm(distances)
gc()
return(means)
}
df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN
Edit: I edited the title to reflect the chunking issue I posted as an answer..
I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:
"Error in double(N * (N - 1)/2) : vector size specified is too large"
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"
The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!
FunDistanceMatrixChunking <- function (df, blockSize=100){
n <- nrow(df)
blocks <- n %/% blockSize
if((n %% blockSize) > 0)blocks <- blocks + 1
chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
dex <- 1:blockSize
chunk <- 0
for(i in 1:blocks){
p <- dex + (i-1)*blockSize
lex <- (blockSize+1):(2*blockSize)
lex <- lex[p<= n]
p <- p[p<= n]
for(j in 1:blocks){
q <- dex +(j-1)*blockSize
q <- q[q<=n]
if (i == j) {
chunk <- chunk+1
x <- dist(df[p,])
chunk.means[chunk,] <- c(length(x), mean(x))}
if ( i > j) {
chunk <- chunk+1
x <- as.matrix(dist(df[c(q,p),]))[lex,dex]
chunk.means[chunk,] <- c(length(x), mean(x))}
}
}
mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)
Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..
A few thoughts:
unique(df[1])
probably works (by ignoring the data.frame property of your list), but makes me nervous and is hard to read. unique(df[,1])
would be better.
for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
can be done with split
.
- If you're worried about memory, definitely don't store the entire distance matrix for every level, then calculate your summary statistics for every factor level! Change your lapply to something like:
lapply (df.l, function(x) mean(dist(x[,2:length(x)], method="minkowski", p=2)))
.
If you need more than one summary statistic, calculate both and return a list:
lapply (df.l, function(x) {
dmat <- dist(x[,2:length(x)], method="minkowski", p=2)
list( mean=mean(dmat), median=median(dmat) )
})
See if that gets you anywhere. If not, you may have to go more specialized (avoiding lapply
, storing your data.frames as matrices instead, etc.)