How to compute big geographic distance matrix

2019-08-23 05:06发布

问题:

I have a dataframe of ids and coordinates. I need to calculate the geographic distance between all my ids, drop the ones that are too far from each other, and then go on with my analysis.

I have 30k ids, which would generate a 30k x 30k matrix. Here is a sample:

 latitude longitude        id
-23.52472 -46.47785 917_62346
-23.62010 -46.69345 244_42975
-23.61636 -46.48148 302_75289
-23.53826 -46.46756 917_96304
-23.58266 -46.54495 302_84126
-23.47005 -46.70921 233_97098
-23.49235 -46.49342 917_62953
-23.52226 -46.72710 244_42245
-23.64853 -46.72237 635_90928
-23.49640 -46.61215  244_2662

x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624, 
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581, 
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512, 
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192, 
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975", 
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953", 
"244_42245", "635_90928", "244_2662")), .Names = c("latitude", 
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L, 
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")

First I tried to go straight for it, using the geosphere package:

library(geosphere)
library(data.table)
d.matrix <- distm(cbind(x2$longitude, x2$latitude))

This does not work, because of memory issues, Error: cannot allocate vector of size 15.4 Gb. My second attempt was to first generate all the pairwise combinations beforehand, than merge with the original data set to get the lats and lons, and then calculate the distances, such as

dis.long <- expand.grid(x2$id, x2$id)
dis.long <- merge(dis.long, x2, by.x = "Var1", by.y = "id")
dis.long <- merge(dis.long, x2, by.x = "Var2", by.y = "id")
dis.long <- dis.long[ , dist_km2 := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2), 
                                        matrix(c(longitude.y, latitude.y), ncol = 2))/1000]

However, expand_grid runs out of memory. This is strange to me, since the resulting matrix would be 900mi rows by 2 cols, and I already deal with data sets way larger (like 200 mi x 50 matrices).

Another observation, I already tried using ids such as new_id = seq(1L,30000L,1L), to see whether integers would solve it, but I get the same memory issue when I try to expand.

I am currently under these configurations, besides a 16gb Ram desktop

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xlsx_0.5.7        xlsxjars_0.6.1    rJava_0.9-8       geosphere_1.5-5   sp_1.2-5          haven_1.0.0      
[7] stringr_1.2.0     data.table_1.10.4

Can anybody give me an idea of how to calculate these distances? And why I cant generate this specific expand.grid while being able to construct bigger objects?

回答1:

You do not need to compare all-vs-all, which includes self-comparison and directional comparison (A-B != B-A); therefore you should try combn instead of expand.grid

Your data

x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624, 
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581, 
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512, 
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192, 
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975", 
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953", 
"244_42245", "635_90928", "244_2662")), .Names = c("latitude", 
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L, 
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")

expand.grid

OP <- function(df) {
            x3 = expand.grid(df$id, df$id)
            Y <- merge(x3, df, by.x = "Var1", by.y = "id")
            Y <- merge(Y, df, by.x = "Var2", by.y = "id")
            return(Y)
      }

vs combn

CP <- function(df) {
            Did = as.data.frame(t(combn(df$id, 2)))
            Z <- merge(Did, df, by.x = "V1", by.y = "id")
            Z <- merge(Z, df, by.x = "V2", by.y = "id")
            return(Z)
      }

Comparison

dis.long <- OP(x2)
object.size(dis.long)
# 7320 bytes

new <- CP(x2)
object.size(new)
# 5016 bytes

Much larger example

num <- 5e2
bigx <- data.frame(latitude=rnorm(num)*-23, longitude=rnorm(num)*-46, id=1:num)

bigdl <- OP(bigx)
object.size(bigdl)
# 10001224 bytes

bignew <- CP(bigx)
object.size(bignew)
# 4991224 bytes

About half the size