distance matrix of curves in python

2019-01-23 08:46发布

问题:

I have a set of curves defined as 2D arrays (number of points, number of coordinates). I am calculating a distance matrix for them using Hausdorff distance. My current code is as follows. Unfortunately it is too slow with 500-600 curves each having 50-100 3D points. Is there any faster way for that?

def distanceBetweenCurves(C1, C2):
    D = scipy.spatial.distance.cdist(C1, C2, 'euclidean')

    #none symmetric Hausdorff distances
    H1 = np.max(np.min(D, axis=1))
    H2 = np.max(np.min(D, axis=0))

    return (H1 + H2) / 2.

def distanceMatrixOfCurves(Curves):
    numC = len(Curves)

    D = np.zeros((numC, numC))
    for i in range(0, numC-1):
        for j in range(i+1, numC):
            D[i, j] = D[j, i] = distanceBetweenCurves(Curves[i], Curves[j])

    return D

回答1:

Your question might also be related to this one

This is kind of a hard problem. A possible way would be to implement the euclidian distance on your own, completely abandon scipy and make use of pypy's JIT compiler. But most likely this will not make you gane much.

Personally, I would recommend you to write the routine in C.

The problem is less the implementation but the way you approach this problem. You chose a brute force approach by calculating the euclidian distance for each distinct pair of points in each possible pair of the metric space subsets. This is computationally demanding:

  • Assume you have 500 curves and each of them has 75 points. With the brute force approach you end up calculating the euclidean distance 500 * 499 * 75 * 75 = 1 403 437 500 times. It is not further surprising that this approach takes forever to run.

I'm not an expert with this but I know that the Hausdorff distance is extensively used in image processing. I would suggest you to browse the literature for speed optimized algorithms. A starting point might be this, or this paper. Also, often mentioned in combination with the Hausdorff distance is the Voroni diagram.

I hope these links might help you with this problem.



回答2:

I recently replied on a similar quiestion here: Hausdorff distance between 3D grids

I hope this helps, I faced with 25 x 25.000 points in a pairwise comparison (25 x 25 x 25.000 points in total), and my code runs from 1 min up to 3-4 hours (depending on the number of points). I don't see much options mathemtically for gaining speed.

Alternatives can be to use different programming languages (C / C++) or bringing this calculation to GPU (CUDA). I am playing with the CUDA approach right now.

Edit on 03/12/2015:

I was able to speed up this comparison by doing parallel CPU based computation. That was the quickest way to go. I used the nice example of the pp package (parallel python) and I run on three different computer and phython combination. Unfortunately I had memory errors all the time with the python 2.7 32-bit, so I installed WinPython 2.7 64-bit and some experimental numpy 64-bit packages.

So to me this effor was quite helpful an it was not as complicated to me as the CUDA.... Good luck



回答3:

There are several methods you can try:

  1. Using numpy-MKL which makes use of Intel's high performance Math Kernel Library instead of numpy;
  2. Using Bootleneck for array functions;
  3. Using Cpython for computation.