I am facing a coding (optimization) problem in R. I have a long data set with GPS coordinates (lon, lat, timestamp) and for every row I need to check whether the location is near a bus stop. I have a .csv file with all the bus stops (in the Netherlands). The GPS coordinates file is millions of entries long, but could be split if necessary. The bus stop dataset is around 5500 entries long.
Using the code and tips given on, inter alia, these pages:
1) How to efficiently calculate distance between pair of coordinates using data.table :=
2) Using a simple for loop on spatial data
3) Calculate distance between two latitude-longitude points? (Haversine formula)
4) Fastest way to determine COUNTRY from millions of GPS coordinates [R]
I was able to construct a code that works, but is (too) slow. I was wondering if someone can help me with a faster data.table() implementation or can point out where the bottle neck in my code is? Is it the spDistsN1() function, or maybe the apply and melt() functions combination? I am most comfortable in R, but open to other software (as long as it is open source).
Due to privacy concerns I cannot upload the full dataset, but this is a (small) reproducible example that is not too different from how the real data looks.
# packages:
library(data.table)
library(tidyverse)
library(sp)
# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates,
lat=runif(number_of_GPS_coordinates,50.5,53.5),
lon=runif(number_of_GPS_coordinates,4,7)))
# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))
names(busdata.data) <- c("latitude_bustops", "longitude_bustops", "name")
Download the real bus stop data if you want, kind of hard to reproduce a random sample of this.
#temp <- tempfile()
#download.file("http://data.openov.nl/haltes/stops.csv.gz", temp) #1.7MB
#gzfile(temp, 'rt')
#busstopdata <- read.csv(temp, stringsAsFactors = FALSE)
#unlink(temp)
#bus_stops <- fread("bus_stops.csv")
#busdata.data <- busstopdata %>%
# mutate(latitude_bustops = latitude)%>%
# mutate(longitude_bustops = longitude)%>%
# dplyr::select(name, latitude_bustops, longitude_bustops)
Code I use now to calculate distances. It works but it is pretty slow
countDataPoints3 <- function(p) {
distances <- spDistsN1(data.matrix(gpsdata[,c("lon","lat")]),
p,
longlat=TRUE) # in km
return(which(distances <= .2)) # distance is now set to 200 meters
}
# code to check per data point if a bus stop is near and save this per bus stop in a list entry
datapoints.by.bustation <- apply(data.matrix(busdata.data[,c("longitude_bustops","latitude_bustops")]), 1, countDataPoints3)
# rename list entries
names(datapoints.by.bustation) <- busdata.data$name
# melt list into one big data.frame
long.data.frame.busstops <- melt(datapoints.by.bustation)
# now switch to data.table grammar to speed up process
# set data.table
setDT(gpsdata)
gpsdata[, rowID := 1:nrow(gpsdata)]
setkey(gpsdata, key = "rowID")
setDT(long.data.frame.busstops)
# merge the data, and filter non-unique entries
setkey(long.data.frame.busstops, key = "value")
GPS.joined <- merge(x = gpsdata, y = long.data.frame.busstops, by.x= "rowID", by.y= "value", all.x=TRUE)
GPS.joined.unique <- unique(GPS.joined, by="id") # mak
# this last part of the code is needed to make sure that if there are more than 1 bus stop nearby it puts these bus stop in a list
# instead of adding row and making the final data.frame longer than the original one
GPS.joined.unique2 <- setDT(GPS.joined.unique)[order(id, L1), list(L1=list(L1)), by=id]
GPS.joined.unique2[, nearby := TRUE][is.na(L1), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# makes sense:
as.tibble(GPS.joined.unique2) %>%
summarize(sum = sum(nearby))
Consider cutting using an slicing method: first cut by close latitudes and close longitudes. In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc). We can use
data.table
's superb support of rolling joins.The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.
This is the function I came up with so far. @Dave2e, thanks. It is already an awful lot faster than what I had. There still is clearly room for a lot of improvement, but as it stands it is fast enough for my analysis now. I only slice by latitude and not longitude. The only reason for that is that it makes indexing and then looping over indices really easy, but more speed could be gained by also indexing by longitude. Also, in real GPS data there tend to be many duplicate values (same lon/lat, different time stamp), the code would also be more efficient if it would take this into account. Maybe I will work on that in the future.