I am working with latitude longitude data. I have to make clusters based on distance between two points. Now distance between two different point is =ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371
I want to use k means in R. Is there any way I can override distance calculation in that process?
K-means is not distance based
It is based on variance minimization. The sum-of-variance formula equals the sum of squared Euclidean distances, but the converse, for other distances, will not hold.
If you want to have an k-means like algorithm for other distances (where the mean is not an appropriate estimator), use k-medoids (PAM). In contrast to k-means, k-medoids will converge with arbitrary distance functions!
For Manhattan distance, you can also use K-medians. The median is an appropriate estimator for L1 norms (the median minimizes the sum-of-differences; the mean minimizes the sum-of-squared-distances).
For your particular use case, you could also transform your data into 3D space, then use (squared) Euclidean distance and thus k-means. But your cluster centers will be somewhere underground!
If you have a data.frame, df
, with columns for lat
and long
, then you should be able to use the earth.dist(...)
function in the fossil
package to calculate a distance matrix, and pass that to pam(...)
in the cluster
package to do the clustering.
library(fossil)
library(cluster)
df <- data.frame(long=<longituces>, lat=<latitudes>))
dist <- earth.dist(df, dist=T)
clust <- pam(dist, k, diss=T)
See earth.dist(...), and pam(...) for documentation
Use the Following Function to calculate the earths distance, doesn't need an existing R function. I found this function on Stackoverflow, just can't remember the link to the article. However, I've validated it with GPS cumulative distance calculations and it aligns.
earthDist <- function (lon1, lat1, lon2, lat2){
rad <- pi/180
a1 <- lat1 * rad
a2 <- lon1 * rad
b1 <- lat2 * rad
b2 <- lon2 * rad
dlon <- b2 - a2
dlat <- b1 - a1
a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
R <- 6378.145
d <- R * c
return(d)
}
call the function, using the following function :
CalculateCumaltiveDist <- function(x,y,id) {
# #Initiate a vectro P
km <- vector()
# #Starting Value is 0, because its home
km[1] <- 0
#Loop through the earthly distance function between the first and Nth Row
for(i in 2:NROW(df)){
t <- earthDist( x[i-1], y[i-1] ,x[i], y[i])
km[i] <- t
if( i == 2 ) {
tmp_All <- data.frame(id[i],x[i], y[i],km[i])
} else if(i > 2) {
tmp_All <- rbind(tmp_All, data.frame(id[i],x[i], y[i],km[i]))
}
}
return(sum(tmp_All$km.i., na.rm = T))
}
if you want the data frame, remove final return sum function.
This will allow you to calculate distance between every single obs-1 and obs in the data frame.
if you want a pairwise distance calculation, then use the earth distance function and loop through between obs[1]:[200000] and obs[1:200000] until all pairwise combinations are calculated. Then transpose this data in to a matrix and you should have a distance matrix.
hope this answers your question