I have a dataframe of ids and coordinates. I need to calculate the geographic distance between all my ids, drop the ones that are too far from each other, and then go on with my analysis.
I have 30k ids, which would generate a 30k x 30k matrix. Here is a sample:
latitude longitude id
-23.52472 -46.47785 917_62346
-23.62010 -46.69345 244_42975
-23.61636 -46.48148 302_75289
-23.53826 -46.46756 917_96304
-23.58266 -46.54495 302_84126
-23.47005 -46.70921 233_97098
-23.49235 -46.49342 917_62953
-23.52226 -46.72710 244_42245
-23.64853 -46.72237 635_90928
-23.49640 -46.61215 244_2662
x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624,
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581,
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512,
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192,
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975",
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953",
"244_42245", "635_90928", "244_2662")), .Names = c("latitude",
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L,
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")
First I tried to go straight for it, using the geosphere
package:
library(geosphere)
library(data.table)
d.matrix <- distm(cbind(x2$longitude, x2$latitude))
This does not work, because of memory issues, Error: cannot allocate vector of size 15.4 Gb
. My second attempt was to first generate all the pairwise combinations beforehand, than merge with the original data set to get the lats and lons, and then calculate the distances, such as
dis.long <- expand.grid(x2$id, x2$id)
dis.long <- merge(dis.long, x2, by.x = "Var1", by.y = "id")
dis.long <- merge(dis.long, x2, by.x = "Var2", by.y = "id")
dis.long <- dis.long[ , dist_km2 := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2),
matrix(c(longitude.y, latitude.y), ncol = 2))/1000]
However, expand_grid runs out of memory. This is strange to me, since the resulting matrix would be 900mi rows by 2 cols, and I already deal with data sets way larger (like 200 mi x 50 matrices).
Another observation, I already tried using ids such as new_id = seq(1L,30000L,1L)
, to see whether integers would solve it, but I get the same memory issue when I try to expand.
I am currently under these configurations, besides a 16gb Ram desktop
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8 geosphere_1.5-5 sp_1.2-5 haven_1.0.0
[7] stringr_1.2.0 data.table_1.10.4
Can anybody give me an idea of how to calculate these distances? And why I cant generate this specific expand.grid while being able to construct bigger objects?
You do not need to compare all-vs-all, which includes self-comparison and directional comparison (A-B != B-A); therefore you should try
combn
instead ofexpand.grid
Your data
expand.grid
vs combn
Comparison
Much larger example
About half the size