I have 44,000 US Zip codes and it's corresponding centroid lat/long in R. This is from the package 'zipcode' in R. I need to calculate the distance between each zipcode and keep those distances that are less than 5 miles. The problem is to calculate all distances between the zipcodes I have to create a vector of size 44,000x44,0000 which I can't due to space issues.
I checked through the posts in R, the closest to my requirement is one that spits out the minimum distance between 2 datasets with lat/long
DB1 <- data.frame(location_id=1:7000,LATITUDE=runif(7000,min = -90,max = 90),LONGITUDE=runif(7000,min = -180,max = 180))
DB2 <- data.frame(location_id=7001:12000,LATITUDE=runif(5000,min = -90,max = 90),LONGITUDE=runif(5000,min = -180,max = 180))
DistFun <- function(ID){
TMP <- DB1[DB1$location_id==ID,]
TMP1 <- distGeo(TMP[,3:2],DB2[,3:2])
TMP2 <- data.frame(DB1ID=ID,DB2ID=DB2[which.min(TMP1),1],DistanceBetween=min(TMP1) )
print(ID)
return(TMP2)
}
DistanceMatrix <- rbind_all(lapply(DB1$location_id, DistFun))
Even if we can modify the above code to incorporate all distances <= 5 miles (for eg), it is extremely slow in execution.
Is there an efficient way to arrive at all zip code combinations that are <=5 miles from each others centroids?
I guess the most efficient algorithms would first translate the spatial locations to a tree-like data structure. You don't need to do this explicitly though, if you have an algorithm that can 1) bin lat/longs to a spatial index, 2) tell you neighbors of that index, then you can use it to filter your square data. (This will be less efficient than building a tree, but probably easier to implement.)
geohash is such an algorithm. It turns continuous lat/long into 2-d bins. There is a (quite new) package providing geohash in R. Here's one idea of how you could use it for this problem:
First, with geohash do some preliminary calibration:
translate lat/long to a hash with bin precision
p
(say)assess whether the hash is calibrated at a precision similar to the distances you're interested in (say, 3-7 miles between neighbor centroids), if not return to 1 and adjust the precision
p
this yields a zipcode-hash value relationship.
Then, compute distances for each (unique) hash value
determine its (8, bc hashes form a 2-d grid) nearest-neighbors and so select 9 hash values
calculate pairwise distances among all zips within the 9 hashes (using, e.g,
distGeo
as in the question)return all zip-zip pairwise distances for the hash value (e.g., in a matrix)
this yields a hash value-zip-zip distance object relationship
(In step 2 it'd clearly be optimal to only calculate each nearest-neighbor pair once. But this might not be necessary.)
Finally, for each zip
distance object for the zip
< 5 miles
this yields a zip-zips within 5 miles object. (the zips within 5 miles of the focal zip could be stored as a column of lists (each element is a list) in a dataframe next to a column of focal zips, or as a separate list with focal zips as names).
Generating the whole distance matrix at a time will be very RAM consuming, looping over each combination of unique zipcodes - very time consuming. Lets find some compromise.
I suggest chunking the
zipcode
data.frame
into pieces of (for example) 100 rows (with the help ofchunk
function from packagebit
), then calculating distances between 44336 and 100 points, filtering according to the target distance treshold and then moving on to the next data chunk. In my example I convertzipcode
data intodata.table
to gain some speed and save RAM.Just for information - that's the approximate size of each piece of data in RAM.
The code itself.
On my machine it took 1.7 minutes to process 10 chunks, so the whole processing may take 70-80 minutes, not fast, but may be satisfying. We can increase the chunk size to 200 or 300 rows depending on available RAM volume, this will shorten the processing time 2 or 3 times respectively.
The drawback of this solution is that the resulting
data.table
contains "duplicated" rows - I mean there are both distances from point A to point B, and from B to A. This may need some additional filtering.