How to bin a matrix

2019-08-03 16:11发布

问题:

numpy.histogram(data, bins) is a very fast and efficient way to calculate how many elements of the data array fall in a bin defined by the array bins. Is there an equivalent function to solve the following problem?. I have a matrix with R rows times C columns. I want to bin each row of the matrix using the definition given by bins. The result should be a further matrix with R rows, and with the number of column equal to the number of bins.

I tried to use the function numpy.histogram(data, bins) giving as input a matrix, but I found that the matrix is treated as an array with R*C elements. Then, the result is an array with Nbins elements.

回答1:

If you're applying this to an array that has many rows this function will give you some speed up at the cost of some temporary memory.

def hist_per_row(data, bins):

    data = np.asarray(data)

    assert np.all(bins[:-1] <= bins[1:])
    r, c = data.shape
    idx = bins.searchsorted(data)
    step = len(bins) + 1
    last = step * r
    idx += np.arange(0, last, step).reshape((r, 1))
    res = np.bincount(idx.ravel(), minlength=last)
    res = res.reshape((r, step))
    return res[:, 1:-1]

The res[:, 1:-1] on the last line is to be consistent with numpy.histogram which returns an array with len len(bins) - 1, but you could drop it if you want to count values that are less than and greater than bins[0] and bins[-1] respectively.



回答2:

Thank you everybody for your answers and comments. Finally, I found a way to speed up the binning procedure. Instead of using np.searchsorted(data), I am doing np.array(data*nbins, dtype=int). Substituting this line in the code posted by Bi Rico, I found that it becomes a factor 3 faster. Here below I post the function by Bi Rico with my modification, so that other user can easily take it.

def hist_per_row(data, bins):

    data = np.asarray(data)
    assert np.all(bins[:-1] <= bins[1:])
    r, c = data.shape

    nbins = len(bins)-1
    data = data/bins[-1]
    idx = array(data*nbins, dtype=int)+1

    step = len(bins) + 1
    last = step * r
    idx += np.arange(0, last, step).reshape((r, 1))
    res = np.bincount(idx.ravel(), minlength=last)
    res = res.reshape((r, step))
    return res[:, 1:-1]


回答3:

something along these lines?

import numpy as np
data = np.random.rand(10,20)
print np.apply_along_axis(lambda x: np.histogram(x)[0], 1, data)