I have a numpy array of 3 million points in the form of [pt_id, x, y, z]
. The goal is to return all pairs of points that have an Euclidean distance two numbers min_d
and max_d
.
The Euclidean distance is between x
and y
and not on the z
. However, I'd like to preserve the array with pt_id_from
, pt_id_to
, distance
attributes.
I'm using scipy's dist to calculate the distances:
import scipy.spatial.distance
coords_arr = np.array([['pt1', 2452130.000, 7278106.000, 25.000],
['pt2', 2479539.000, 7287455.000, 4.900],
['pt3', 2479626.000, 7287458.000, 10.000],
['pt4', 2484097.000, 7292784.000, 8.800],
['pt5', 2484106.000, 7293079.000, 7.300],
['pt6', 2484095.000, 7292891.000, 11.100]])
dists = scipy.spatial.distance.pdist(coords_arr[:,1:3], 'euclidean')
np.savetxt('test.out', scipy.spatial.distance.squareform(dists), delimiter=',')
What should I do to return an array of form: [pt_id_from, pt_id_to, distance]
?
Well,
['pt1', 'pt2', distance_as_number]
is not exactly possible. The closest you can get with mixed datatypes is a structured array but then you can't do things likeresult[:2,0]
. You'll have to index field names and array indices separately like:result[['a','b']][0]
.Here is my solution:
The structured array:
The ndarray:
You can use
np.where
to get a coords of distances within a range, then generate a new list in your format, filtering same pairs. Like this:You simply create a new array from the data by looping through all the possible combinations. The
itertools
module is excellent for this.If memory is a problem, you might want to change the distance lookup from using the huge matrix
D
to looking up the value directly indists
using thei
andj
index.