Clustering using a custom distance metric for lat/

2019-04-06 20:05发布

I'm trying to specify a custom clustering function for the scikit-learn DBSCAN implementation:

def geodistance(latLngA, latLngB):
    print latLngA, latLngB
    return vincenty(latLngA, latLngB).miles

cluster_labels = DBSCAN(
            eps=500,
            min_samples=max(2, len(found_geopoints)/10),
            metric=geodistance
).fit(np.array(found_geopoints)).labels_

However, when I print out the arguments to my distance function they aren't at all what I would expect:

[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]
[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]

This is what my found_geopoints array looks like:

[[  4.24680600e+01   1.40868060e+02]
 [ -2.97677600e+01  -6.20477000e+01]
 [  3.97550400e+01   2.90069000e+00]
 [  4.21144200e+01   1.43442500e+01]
 [  8.56111000e+00   1.24771390e+02]
...

So why aren't the arguments to the distance function latitude longitude pairs?

2条回答
forever°为你锁心
2楼-- · 2019-04-06 20:38

You can do this with scikit-learn: use the haversine metric with the ball-tree algorithm, and pass radian units into the DBSCAN fit method.

This tutorial demonstrates how to cluster spatial lat-long data with scikit-learn's DBSCAN using the haversine metric to cluster based on accurate geodetic distances between lat-long points:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Notice that the coordinates are passed into the .fit() method as radian units, and that the epsilon parameter value must also be in radian units as well.

If you want epsilon to be, say 1.5km, then the epsilon parameter value in radian units would = 1.5/6371.

查看更多
爷的心禁止访问
3楼-- · 2019-04-06 20:43

I seem to have found a work around where I compute a distance matrix using: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html then use it as an argument to DBSCAN(metric='precomputed').fit(distance_matrix)

查看更多
登录 后发表回答