how to use different distance formula other than e

2020-07-22 19:15发布

问题:

I am working with latitude longitude data. I have to make clusters based on distance between two points. Now distance between two different point is =ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371

I want to use k means in R. Is there any way I can override distance calculation in that process?

回答1:

K-means is not distance based

It is based on variance minimization. The sum-of-variance formula equals the sum of squared Euclidean distances, but the converse, for other distances, will not hold.

If you want to have an k-means like algorithm for other distances (where the mean is not an appropriate estimator), use k-medoids (PAM). In contrast to k-means, k-medoids will converge with arbitrary distance functions!

For Manhattan distance, you can also use K-medians. The median is an appropriate estimator for L1 norms (the median minimizes the sum-of-differences; the mean minimizes the sum-of-squared-distances).

For your particular use case, you could also transform your data into 3D space, then use (squared) Euclidean distance and thus k-means. But your cluster centers will be somewhere underground!



回答2:

If you have a data.frame, df, with columns for lat and long, then you should be able to use the earth.dist(...) function in the fossil package to calculate a distance matrix, and pass that to pam(...) in the cluster package to do the clustering.

library(fossil)
library(cluster)
df    <- data.frame(long=<longituces>, lat=<latitudes>))
dist  <- earth.dist(df, dist=T)
clust <- pam(dist, k, diss=T)

See earth.dist(...), and pam(...) for documentation



回答3:

Use the Following Function to calculate the earths distance, doesn't need an existing R function. I found this function on Stackoverflow, just can't remember the link to the article. However, I've validated it with GPS cumulative distance calculations and it aligns.

earthDist <- function (lon1, lat1, lon2, lat2){
  rad <- pi/180
  a1 <- lat1 * rad
  a2 <- lon1 * rad
  b1 <- lat2 * rad
  b2 <- lon2 * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  R <- 6378.145
  d <- R * c
  return(d)
}

call the function, using the following function :

CalculateCumaltiveDist <- function(x,y,id) {

    # #Initiate a vectro P
    km <- vector()
    # #Starting Value is 0, because its home
    km[1] <- 0

    #Loop through the earthly distance function between the first and Nth Row
    for(i in 2:NROW(df)){


      t <-  earthDist(  x[i-1], y[i-1] ,x[i], y[i])
      km[i] <- t

      if( i == 2 ) {

       tmp_All <- data.frame(id[i],x[i], y[i],km[i])

              } else if(i > 2) {

        tmp_All <- rbind(tmp_All, data.frame(id[i],x[i], y[i],km[i]))

        }

    }


    return(sum(tmp_All$km.i., na.rm = T))
}

if you want the data frame, remove final return sum function.

This will allow you to calculate distance between every single obs-1 and obs in the data frame.

if you want a pairwise distance calculation, then use the earth distance function and loop through between obs[1]:[200000] and obs[1:200000] until all pairwise combinations are calculated. Then transpose this data in to a matrix and you should have a distance matrix.

hope this answers your question