Efficient matching of two arrays (how to use KDTre

2019-02-19 10:35发布

问题:

I have two 2d arrays, obs1 and obs2. They represent two independent measurement series, and both have dim0 = 2, and slightly different dim1, say obs1.shape = (2, 250000), and obs2.shape = (2, 250050). obs1[0] and obs2[0] signify time, and obs1[1] and obs2[1] signify some spatial coordinate. Both arrays are (more or less) sorted by time. The times and coordinates should be identical between the two measurement series, but in reality they aren't. Also, not each measurement from obs1 has a corresponding value in obs2 and vice-versa. Another problem is that there might be a slight offset in the times.

I'm looking for an efficient algorithm to associate the best matching value from obs2 to each measurement in obs1. Currently, I do it like this:

define dt = some_maximum_time_difference
define dx = 3
j = 0
i = 0
matchresults = np.empty(obs1.shape[1])
for j in obs1.shape[1]:
    while obs1[0, j] - obs2[0, j] < dt:
        i += 1
    matchresults[j] = i - dx + argmin(abs(obs1[1, i] - obs2[1, i-dx:i+dx+1]))

This yields good results. However, it is extremely slow, running in a loop.

I would be very thankful for ideas on how to improve this algorithm speed-wise, e.g. using KDtree or something similar.

回答1:

Using cKDTree for this case would look like:

from scipy.spatial import cKDTree

obs2 = array with shape (2, m)
obs1 = array with shape (2, n)

kdt = cKDTree(obs2.T)
dist, indices = kdt.query(obs1.T)

where indices will contain the column indices in obs2 corresponding to each observation in obs1. Note that I had to transpose obs1 and obs2.