Starting with latitude/longitude data (in radians), I’m trying to efficiently find the nearest n neighbors, ideally with geodesic (WGS-84) distance.
Right now I’m using sklearn’s BallTree with haversine distance (KD-Tres only take minkowskian distance), which is nice and fast (3-4 seconds to find nearest 5 neighbors for 1200 locations in 7500 possible matches), but not as accurate as I need. Code:
tree = BallTree(possible_matches[['x', 'y']], leaf_size=2, metric='haversine')
distances, indices = tree.query(locations[['x', 'y']], k=5)
When I substitute in a custom function for metric (metric=lambda u, v: geopy.distance.geodesic(u, v).miles
) it takes an "unreasonably" long time (4 minutes in the same case as above). It’s documented that custom functions can take a long time, but doesn't help me solve my problem.
I looked at using a KD-Tree with ECEF coordinates and euclidian distance, but I’m not sure if that’s actually any more accurate.
How can I keep the speed of my current method, but improve my distance accuracy?
The main reason for why your metric is slow is that it written in Python while other metrics in sklearn are written in Cython/C++/C.
So as for instance discussed here for Random Forests or here you would have to implement your metric in Cython, fork your own version of BallTree and include your custom metric there.