Find all point pairs closer than a given maximum d

2019-07-15 16:28发布

站内文章 / Python

31 0

姐就是有狂的资本

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to find (efficiently) all pairs of points that are closer than some distance max_d. My current method, using cdist, is:

import numpy as np
from scipy.spatial.distance import cdist

def close_pairs(X,max_d):
    d = cdist(X,X)

    I,J = (d<max_d).nonzero()
    IJ  = np.sort(np.vstack((I,J)), axis=0)

    # remove diagonal element
    IJ  = IJ[:,np.diff(IJ,axis=0).ravel()<>0]

    # remove duplicate
    dt = np.dtype([('i',int),('j',int)])
    pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)

    return pairs

def test():
    X = np.random.rand(100,2)*20
    p = close_pairs(X,2)

    from matplotlib import pyplot as plt
    plt.clf()
    plt.plot(X[:,0],X[:,1],'.r')
    plt.plot(X[p,0].T,X[p,1].T,'-b')

But I think this is overkill (and not very readable), because most of the work is done only to remove distance-to-self and duplicates.

My main question is: is there a better way to do it?

(Note: the type of outputs (array, set, ...) is not important at this point)

My current thinking is on using pdist which returns a condensed distance array which contains only the right pairs. However, once I found the suitable coordinates k's from the condensed distance array, how do I compute which i,j pairs it is equivalent to?

So the alternative question is: is there an easy way to get the list of coordinate pairs relative to the entries of pdist outputs:

a function f(k)->i,j
such that cdist(X,X)[i,j] = pdist(X)[k]

回答1:

In my experience, there are two fastest ways to find neighbor lists in 3D. One is to use a most naive double-for-loop code written in C++ or Cython (in my case, both). It runs in N^2, but is very fast for small systems. The other way is to use a linear time algorithm. Scipy ckdtree is a good choice, but has limitations. Neighbor list finders from molecular dynamics software are most powerful, but are very hard to wrap, and likely have slow initialization time.

Below I compare four methods:

Naive cython code
Wrapper around OpenMM (is very hard to install, see below)
Scipy.spatial.ckdtree
scipy.spatial.distance.pdist

Test setup: n points scattered in a rectangular box at volume density 0.2. System size ranging from 10 to a 1000000 (a million) particles. Contact radius is taken from 0.5, 1, 2, 4, 7, 10. Note that because density is 0.2, at contact radius 0.5 we'll have on average about 0.1 contacts per particle, at 1 = 0.8, at 2 = 6.4, and at 10 - about 800! Contact finding was repeated several times for small systems, done once for systems >30k particles. If time per call exceeded 5 seconds, the run was aborted.

Setup: dual xeon 2687Wv3, 128GB RAM, Ubuntu 14.04, python 2.7.11, scipy 0.16.0, numpy 1.10.1. None of the code was using parallel optimizations (except for OpenMM, though parallel part went so quick that it was not even noticeable on a CPU graph, most of the time was spend piping data to-from OpenMM).

Results: Note that plots below are logscale, and spread over 6 orders of magnitude. Even small visual difference may be actually 10-fold. For systems less than 1000 particles, Cython code was always faster. However, after 1000 particles results are dependent on the contact radius. pdist implementation was always slower than cython, and takes much more memory, because it explicitly creates a distance matrix, which is slow because of sqrt.

At small contact radius (<1 contact per particle), ckdtree is a good choice for all system sizes.
At medium contact radius, (5-50 contacts per particle) naive cython implementation is the best up to 10000 particles, then OpenMM starts to win by about several orders of magnitude, but ckdtree performs just 3-10 times worse
At high contact radius (>200 contacts per particle) naive methods work up to 100k or 1M particles, then OpenMM may win.

Installing OpenMM is very tricky; you can read more in http://bitbucket.org/mirnylab/openmm-polymer file "contactmaps.py" or in the readme. However, the results below show that it is only advantageous for 5-50 contacts per particle, for N>100k particles.

Cython code below:

import numpy as np
cimport numpy as np
cimport cython

cdef extern from "<vector>" namespace "std":
    cdef cppclass vector[T]:
        cppclass iterator:
            T operator*()
            iterator operator++()
            bint operator==(iterator)
            bint operator!=(iterator)
        vector()
        void push_back(T&)
        T& operator[](int)
        T& at(int)
        iterator begin()
        iterator end()

np.import_array() # initialize C API to call PyArray_SimpleNewFromData
cdef public api tonumpyarray(int* data, long long size) with gil:
    if not (data and size >= 0): raise ValueError
    cdef np.npy_intp dims = size
    #NOTE: it doesn't take ownership of `data`. You must free `data` yourself
    return np.PyArray_SimpleNewFromData(1, &dims, np.NPY_INT, <void*>data)

@cython.boundscheck(False)
@cython.wraparound(False)
def contactsCython(inArray, cutoff):
    inArray = np.asarray(inArray, dtype = np.float64, order = "C")
    cdef int N = len(inArray)
    cdef np.ndarray[np.double_t, ndim = 2] data = inArray
    cdef int j,i
    cdef double curdist
    cdef double cutoff2 = cutoff * cutoff  # IMPORTANT to avoid slow sqrt calculation
    cdef vector[int] contacts1
    cdef vector[int] contacts2
    for i in range(N):
        for j in range(i+1, N):
            curdist = (data[i,0] - data[j,0]) **2 +(data[i,1] - data[j,1]) **2 + (data[i,2] - data[j,2]) **2
            if curdist < cutoff2:
                contacts1.push_back(i)
                contacts2.push_back(j)
    cdef int M = len(contacts1)

    cdef np.ndarray[np.int32_t, ndim = 2] contacts = np.zeros((M,2), dtype = np.int32)
    for i in range(M):
        contacts[i,0] = contacts1[i]
        contacts[i,1] = contacts2[i]
    return contacts

Compilation (or makefile) for Cython code:

    cython --cplus fastContacts.pyx
    g++  -g -march=native -Ofast -fpic -c   fastContacts.cpp -o fastContacts.o `python-config --includes`
    g++  -g -march=native -Ofast -shared  -o fastContacts.so  fastContacts.o `python-config --libs`

Testing code:

from __future__ import print_function, division

import signal
import time
from contextlib import contextmanager

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial import ckdtree
from scipy.spatial.distance import pdist

from contactmaps import giveContactsOpenMM  # remove this unless you have OpenMM and openmm-polymer libraries installed
from fastContacts import contactsCython


class TimeoutException(Exception): pass


@contextmanager
def time_limit(seconds):
    def signal_handler(signum, frame):
        raise TimeoutException("Timed out!")

    signal.signal(signal.SIGALRM, signal_handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)


matplotlib.rcParams.update({'font.size': 8})


def close_pairs_ckdtree(X, max_d):
    tree = ckdtree.cKDTree(X)
    pairs = tree.query_pairs(max_d)
    return np.array(list(pairs))


def condensed_to_pair_indices(n, k):
    x = n - (4. * n ** 2 - 4 * n - 8 * k + 1) ** .5 / 2 - .5
    i = x.astype(int)
    j = k + i * (i + 3 - 2 * n) / 2 + 1
    return np.array([i, j]).T


def close_pairs_pdist(X, max_d):
    d = pdist(X)
    k = (d < max_d).nonzero()[0]
    return condensed_to_pair_indices(X.shape[0], k)


a = np.random.random((100, 3)) * 3  # test set
methods = {"cython": contactsCython, "ckdtree": close_pairs_ckdtree, "OpenMM": giveContactsOpenMM,
           "pdist": close_pairs_pdist}

# checking that each method gives the same value
allUniqueInds = []
for ind, method in methods.items():
    contacts = method(a, 1)
    uniqueInds = contacts[:, 0] + 100 * contacts[:, 1]  # unique index of each contacts
    allUniqueInds.append(np.sort(uniqueInds))  # adding sorted unique conatcts
for j in allUniqueInds:
    assert np.allclose(j, allUniqueInds[0])

# now actually doing testing
repeats = [30,30,30, 30, 30, 20,  20,   10,   5,   3,     2 ,       1,     1,      1]
sizes =    [10,30,100, 200, 300,  500, 1000, 2000, 3000, 10000, 30000, 100000, 300000, 1000000]
systems = [[np.random.random((n, 3)) * ((n / 0.2) ** 0.333333) for k in range(repeat)] for n, repeat in
           zip(sizes, repeats)]

for j, radius in enumerate([0.5, 1, 2, 4, 7, 10]):
    plt.subplot(2, 3, j + 1)
    plt.title("Radius = {0}; {1:.2f} cont per particle".format(radius, 0.2 * (4 / 3 * np.pi * radius ** 3)))

    times = {i: [] for i in methods}

    for name, method in methods.items():
        for n, system, repeat in zip(sizes, systems, repeats):
            if name == "pdist" and n > 30000:
                break  # memory issues
            st = time.time()
            try:
                with time_limit(5 * repeat):
                    for ind in range(repeat):
                        k = len(method(system[ind], radius))
            except:
                print("Run aborted")
                break
            end = time.time()
            mytime = (end - st) / repeat
            times[name].append((n, mytime))
            print("{0} radius={1} n={2} time={3} repeat={4} contPerParticle={5}".format(name, radius, n, mytime,repeat, 2 * k / n))

    for name in sorted(times.keys()):
        plt.plot(*zip(*times[name]), label=name)
    plt.xscale("log")
    plt.yscale("log")
    plt.xlabel("System size")
    plt.ylabel("Time (seconds)")
    plt.legend(loc=0)

plt.show()

回答2:

Here's how to do it with the cKDTree module. See query_pairs

import numpy as np
from scipy.spatial.distance import cdist
from scipy.spatial import ckdtree


def close_pairs(X,max_d):
    d = cdist(X,X)

    I,J = (d<max_d).nonzero()
    IJ  = np.sort(np.vstack((I,J)), axis=0)

    # remove diagonal element
    IJ  = IJ[:,np.diff(IJ,axis=0).ravel()<>0]

    # remove duplicate
    dt = np.dtype([('i',int),('j',int)])
    pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)

    return pairs

def close_pairs_ckdtree(X, max_d):
    tree = ckdtree.cKDTree(X)
    pairs = tree.query_pairs(max_d)
    return np.array(list(pairs))

def test():
    np.random.seed(0)
    X = np.random.rand(100,2)*20
    p = close_pairs(X,2)
    q = close_pairs_ckdtree(X, 2)

    from matplotlib import pyplot as plt
    plt.plot(X[:,0],X[:,1],'.r')
    plt.plot(X[p,0].T,X[p,1].T,'-b')
    plt.figure()
    plt.plot(X[:,0],X[:,1],'.r')
    plt.plot(X[q,0].T,X[q,1].T,'-b')

    plt.show()

t

回答3:

I finally found it myself. The function converting indices k in condensed distance array to equivalent i,j in square distance array is:

def condensed_to_pair_indices(n,k):
    x = n-(4.*n**2-4*n-8*k+1)**.5/2-.5
    i = x.astype(int)
    j = k+i*(i+3-2*n)/2+1
    return i,j

I had to play a little with sympy to find it. Now, to compute all point pairs than are less than a given distance apart:

def close_pairs_pdist(X,max_d):
    d = pdist(X)
    k = (d<max_d).nonzero()[0]
    return condensed_to_pair_indices(X.shape[0],k)

As expected, it is more efficient than the other methods (but I did not test ckdtree). I will update the timeit answer.

回答4:

slightly faster, didnt test the time difference thoroughly, but if i ran it a few times, it gave a time of about 0.0755529403687 for my method, and 0.0928771495819 for yours. I use the triu method to get rid of upper triangle of the array (where duplicates are) including diagonal (which is where the self-distances are), and i dont sort either, since if you plot it, it does not matter if i plot them in order or not. So i guess it speeds up about 15% or so

import numpy as np
from scipy.spatial.distance import cdist
from scipy.misc import comb

def close_pairs(X,max_d):
    d = cdist(X,X)
    I,J = (d<max_d).nonzero()
    IJ  = np.sort(np.vstack((I,J)), axis=0)

    # remove diagonal element
    IJ  = IJ[:,np.diff(IJ,axis=0).ravel()<>0]

    # remove duplicate
    dt = np.dtype([('i',int),('j',int)])
    pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)

    return pairs


def close_pairs1(X,max_d):
    d = cdist(X,X)
    d1 = np.triu_indices(len(X)) # indices of the upper triangle including the diagonal
    d[d1] = max_d+1 # value that will not get selected when doing d<max_d in the next line
    I,J = (d<max_d).nonzero()
    pairs = np.vstack((I,J)).T
    return pairs

def close_pairs3(X, max_d):
    d = pdist(X)
    n = len(X)
    pairs = np.zeros((0,2))
    for i in range(n):
        for j in range(i+1,n):
            # formula from http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html
            a=d[int(comb(n,2)-comb(n-i,2)+j-i-1+0.1)] # the +0.1 is because otherwise i get floating point trouble
            if(a<max_d):
                pairs = np.r_[pairs, np.array([i,j])[None,:]]
    return pairs

def close_pairs4(X, max_d):
    d = pdist(X)
    n = len(X)
    a = np.where(d<max_d)[0]
    i = np.arange(n)[:,None]
    j = np.arange(n)[None,:]
    b = np.array(comb(n,2)-comb(n-i,2)+j-i-1+0.1, dtype=int)
    d1 = np.tril_indices(n)
    b[d1] = -1

    pairs = np.zeros((0,2), dtype=int)

    # next part is the bottleneck: the np.where each time, 
    for v in a:
        i, j = np.where(v==b) 
        pairs = np.r_[pairs, np.array([i[0],j[0]])[None,:]]
    return pairs

def close_pairs5(X, max_d):
    t0=time.time()
    d = pdist(X)
    n = len(X)
    a = np.where(d<max_d)[0]
    i = np.arange(n)[:,None]
    j = np.arange(n)[None,:]
    t1 = time.time()
    b = np.array(comb(n,2)-comb(n-i,2)+j-i-1+0.1, dtype=int)
    d1 = np.tril_indices(n)
    b[d1] = -1
    t2 = time.time()
    V = b[:,:,None]-a[None,None,:] # takes a little time
    t3 = time.time()
    p = np.where(V==0) # takes most of the time, thought that removing the for-loop from the previous method might improve it, but it does not do that much. This method contains the formula you wanted though, but apparently it is still faster if you use the cdist methods
    t4 = time.time()
    pairs = np.vstack((p[0],p[1])).T
    print t4-t3,t3-t2, t2-t1, t1-t0
    return pairs





def test():
    X = np.random.rand(1000,2)*20
    import time
    t0 = time.time()
    p = close_pairs(X,2)
    t1 = time.time()
    p2 = close_pairs1(X,2)
    t2 = time.time()
    print t2-t1, t1-t0

    from matplotlib import pyplot as plt
    plt.figure()
    plt.clf()
    plt.plot(X[:,0],X[:,1],'.r')
    plt.plot(X[p,0].T,X[p,1].T,'-b')
    plt.figure()
    plt.clf()
    plt.plot(X[:,0],X[:,1],'.r')
    plt.plot(X[p2,0].T,X[p2,1].T,'-b')
    plt.show()



test()

NOTE: plotting laggs if you do it for 1K points, but it needs 1K points to compare speeds, but i checked that it works correctly when plotting it if doing it with 100 points The speed difference is something like ten to twenty percent, and i think it will not get much better than this, since i got rid of all the sorting and unique elements things, so the part that takes most of the time probably is the d = cdist(X, X) line

Edit: some more testing shows that in those times, that cdist line takes up about 0.065 sec, while the rest for your method is about 0.02 and for me about 0.015 sec or so. Conclusion: the main bottleneck of your code is the d = cdist(X, X) line, and the stuff i changed speeds up the rest of the code you got, but the main bottleneck stays

Edit: added the method close_pairs3, which gives you the formula, but speed blows, (still need to figure out how to invert that formula, and than it will be superfast, will do that tomorrow - will use np.where(pdist(X)

Edit: added method close_pairs4, which is slightly better than 3, and explains what happens, but is veeery slow, and same with method 5, does not have that for-loop, but is still very slow

回答5:

I made some code to compare the proposed solutions.

Note: I use scipy 0.11 and cannot use the ckdtree solution (only kdtree) which I expect to be slower. Could anyone with scipy v0.12+ run this code?

import numpy as np
from scipy.spatial.distance import cdist, pdist
from scipy.spatial import ckdtree
from scipy.spatial import kdtree


def close_pairs(X,max_d):
    d = cdist(X,X)

    I,J = (d<max_d).nonzero()
    IJ  = np.sort(np.vstack((I,J)), axis=0)

    # remove diagonal element
    IJ  = IJ[:,np.diff(IJ,axis=0).ravel()<>0]

    # remove duplicate
    dt = np.dtype([('i',int),('j',int)])
    pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)

    return pairs


def condensed_to_pair_indices(n,k):
    x = n-(4.*n**2-4*n-8*k+1)**.5/2-.5
    i = x.astype(int)
    j = k+i*(i+3-2*n)/2+1
    return i,j

def close_pairs_pdist(X,max_d):
    d = pdist(X)
    k = (d<max_d).nonzero()[0]
    return condensed_to_pair_indices(X.shape[0],k)


def close_pairs_triu(X,max_d):
    d = cdist(X,X)
    d1 = np.triu_indices(len(X)) # indices of the upper triangle including the diagonal
    d[d1] = max_d+1 # value that will not get selected when doing d<max_d in the next line
    I,J = (d<max_d).nonzero()
    pairs = np.vstack((I,J)).T
    return pairs

def close_pairs_ckdtree(X, max_d):
    tree = ckdtree.cKDTree(X)
    pairs = tree.query_pairs(max_d)
    return pairs       # remove the conversion as it is not required

def close_pairs_kdtree(X, max_d):
    tree  = kdtree.KDTree(X)
    pairs = tree.query_pairs(max_d)
    return pairs       # remove the conversion as it is not required


methods = [close_pairs, close_pairs_pdist, close_pairs_triu, close_pairs_kdtree] #, close_pairs_ckdtree]

def time_test(n=[10,50,100], max_d=[5,10,50], iter_num=100):
    import timeit

    for method in methods:
        print '-- time using ' + method.__name__ + ' ---'
        for ni in n:
            for d in max_d:
                setup = '\n'.join(['import numpy as np','import %s' % __name__,'np.random.seed(0)','X = np.random.rand(%d,2)*100'%ni])
                stmt  = 'close_pairs.%s(X,%f)' % (method.__name__,d)
                time  = timeit.timeit(stmt=stmt, setup=setup, number=iter_num)/iter_num
                print 'n=%3d, max_d=%2d: \t%.2fms' % (ni, d,time*1000)

Output of time_test(iter_num=10,n=[20,100,500],max_d=[1,5,10]) are:

-- time using close_pairs ---
n= 20, max_d= 1:    0.22ms
n= 20, max_d= 5:    0.16ms
n= 20, max_d=10:    0.21ms
n=100, max_d= 1:    0.41ms
n=100, max_d= 5:    0.53ms
n=100, max_d=10:    0.97ms
n=500, max_d= 1:    7.12ms
n=500, max_d= 5:    12.28ms
n=500, max_d=10:    33.41ms
-- time using close_pairs_pdist ---
n= 20, max_d= 1:    0.11ms
n= 20, max_d= 5:    0.10ms
n= 20, max_d=10:    0.11ms
n=100, max_d= 1:    0.19ms
n=100, max_d= 5:    0.19ms
n=100, max_d=10:    0.19ms
n=500, max_d= 1:    2.31ms
n=500, max_d= 5:    2.82ms
n=500, max_d=10:    2.49ms
-- time using close_pairs_triu ---
n= 20, max_d= 1:    0.17ms
n= 20, max_d= 5:    0.16ms
n= 20, max_d=10:    0.16ms
n=100, max_d= 1:    0.83ms
n=100, max_d= 5:    0.80ms
n=100, max_d=10:    0.80ms
n=500, max_d= 1:    23.64ms
n=500, max_d= 5:    22.87ms
n=500, max_d=10:    22.96ms
-- time using close_pairs_kdtree ---
n= 20, max_d= 1:    1.71ms
n= 20, max_d= 5:    1.69ms
n= 20, max_d=10:    1.96ms
n=100, max_d= 1:    34.99ms
n=100, max_d= 5:    35.47ms
n=100, max_d=10:    34.91ms
n=500, max_d= 1:    253.87ms
n=500, max_d= 5:    255.05ms
n=500, max_d=10:    256.66ms

Conclusion:

The overall fastest method is close_pairs_pdist
The initial method is relatively fast, but sensitive to both the number of samples and the percentage of pairs to return
both the close_pairs_triu and close_pairs_kdtree are sensitive to the number of samples but relatively insensitive to the number of outputs.
the close_pairs_triu method is much faster than close_pairs_kdtree

However, the ckdtree method needs to be tested.