How to calculate Euclidean distance (and save only

I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).

I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.

Thanks in advance (and for your patience).

require(proxy)

df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))

min.dist<-function(df) {  
 #df for results
 all.min.dist<-data.frame()
 #set up for loop 
 for(k in 1:nrow(df)) {
     #calcuate dissimilarity between each row and all other rows
     df.dist<-dist(df[k,],df[-k,])
     # find minimum distance
     min.dist<-min(df.dist)
     # get rowname for minimum distance (id of nearest point)
     closest.row<-row.names(df)[-k][which.min(df.dist)]
     #combine outputs
     all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
     dist=min.dist, closest_row=closest.row))
    }
 #return results
 return(all.min.dist)
                        } 
 #example
 min.dist(df)

标签： r for-loop distance vectorization

2条回答

虎瘦雄心在

2楼-- · 2019-02-25 11:37

Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).

It seems that the function dist {stats} answers your question spot on:

Description This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

0人赞添加讨论(0) 举报

Ridiculous、

3楼-- · 2019-02-25 11:42

This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.

min.dist <- function(df) {

  which.closest <- function(k, df) {
    d <- colSums((df[, -k] - df[, k]) ^ 2)
    m <- which.min(d)
    data.frame(orig_row    = row.names(df)[k],
               dist        = sqrt(d[m]),
               closest_row = row.names(df)[-k][m])
  }

  do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}

If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.

Edit: Also read https://stackoverflow.com/a/16670220/1201032

0人赞添加讨论(0) 举报

How to calculate Euclidean distance (and save only

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间