What is a good way to bin numerical values into a certain range? For example, suppose I have a list of values and I want to bin them into N bins by their range. Right now, I do something like this:
from scipy import *
num_bins = 3 # number of bins to use
values = # some array of integers...
min_val = min(values) - 1
max_val = max(values) + 1
my_bins = linspace(min_val, max_val, num_bins)
# assign point to my bins
for v in values:
best_bin = min_index(abs(my_bins - v))
where min_index returns the index of the minimum value. The idea is that you can find the bin the point falls into by seeing what bin it has the smallest difference with.
But I think this has weird edge cases. What I am looking for is a good representation of bins, ideally ones that are half closed half open (so that there is no way of assigning one point to two bins), i.e.
bin1 = [x1, x2)
bin2 = [x2, x3)
bin3 = [x3, x4)
etc...
what is a good way to do this in Python, using numpy/scipy? I am only concerned here with binning integer values.
thanks very much for your help.
numpy.histogram()
does exactly what you want.The function signature is:
We're mostly interested in
a
andbins
.a
is the input data that needs to be binned.bins
can be a number of bins (yournum_bins
), or it can be a sequence of scalars, which denote bin edges (half open).To quote the documentation:
Edit: You want to know the index in your bins of each element. For this, you can use
numpy.digitize()
. If your bins are going to be integral, you can usenumpy.bincount()
as well.Since the interval is open on the upper limit, the indices are correct:
This is fairly straightforward in numpy using broadcasting--my example below is four lines of code (not counting first two lines to create bins and data points, which would of course ordinarily be supplied.)
'bin_assignments' is a 1d array of indices comprised of integer values from 0 to 4, corresponding to the five bins--the bin assignments for each of the 30 original points in the 'data' matrix above.