Finding local maxima of xy data point graph with n

2019-03-21 16:09发布

问题:

I would like to get most efficient way to find local maxima in huge data point sets containing thousands of values. As input are used two long lists with x and y values.

Consider this simple example:

xval = [-0.15, -0.02, 0.1, 0.22, 0.36, 0.43, 0.58, 0.67, 0.79, 0.86, 0.96 ]

yval = [-0.09, 0.13, -0.01, -0.1, -0.05, 0.2, 0.56, 0.47, 0.35, 0.43, 0.69]

Desired output is list with indexes of peak points, here locMaxId =[1,6,10]. Comparing the closest neighbours is solution, but for 10k values?

回答1:

You can let numpy handle the iteration, i.e. vectorize it:

def local_maxima(xval, yval):
    xval = np.asarray(xval)
    yval = np.asarray(yval)

    sort_idx = np.argsort(xval)
    yval = yval[sort_idx]
    gradient = np.diff(yval)
    maxima = np.diff((gradient > 0).view(np.int8))
    return np.concatenate((([0],) if gradient[0] < 0 else ()) +
                          (np.where(maxima == -1)[0] + 1,) +
                          (([len(yval)-1],) if gradient[-1] > 0 else ()))

EDIT So the code first computes the variation from every point to the nex(gradient). The next step is a little tricky... If you do np.diff((gradient > 0) the resulting boolean array is True where there is a change from growing (> 0) to not growing(<= 0). By making it a signed int of the same size as the boolean array, you can discriminate from transitions from growing to not growing (-1) to the opposite (+1). By taking a .view(np.int8) of a signed integer type of the same dtype size as the boolean array, we avoid copying the data, as would happen if we did the less hacky .astype(int). All that's left is taking care of the first and last points, and concatenating all points together into a single array. One thing I found out today is that if you include an empty list in the tuple you send to np.concatenate, it comes out as an empty array of dtype np.float, and that ends up being the dtype of the result, hence the more complicated concatenation of empty tuples in the above code.

It works:

In [2]: local_maxima(xval, yval)
Out[2]: array([ 1,  6, 10], dtype=int64)

And is reasonably fast:

In [3]: xval = np.random.rand(10000)

In [4]: yval = np.random.rand(10000)

In [5]: local_maxima(xval, yval)
Out[5]: array([   0,    2,    4, ..., 9991, 9995, 9998], dtype=int64)

In [6]: %timeit local_maxima(xval, yval)
1000 loops, best of 3: 1.16 ms per loop

Also, most of the time is converting your data from lists to arrays and sorting them. If your data is already sorted and kept in arrays, you can probably improve performance over the above by a factor of 5x.