I've a huge number of entries, every one is a float number. These data x
are accesible with an iterator. I need to classify all the entries using selection like 10<y<=20
, 20<y<=50
, .... where y
are data from an other iterables. The number of entries is much more than the number of selections. At the end I want a dictionary like:
{ 0: [all events with 10<x<=20],
1: [all events with 20<x<=50], ... }
or something similar. For example I'm doing:
for x, y in itertools.izip(variable_values, binning_values):
thebin = binner_function(y)
self.data[tuple(thebin)].append(x)
in general y
is multidimensional.
This is very slow, is there a faster solution, for example with numpy? I think the problem cames from the list.append
method I'm using and not from the binner_function
A fast way to get the assignments in numpy is using np.digitize
:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html
You'd still have to split the resulting assignments up into groups. If x
or y
is multidimensional, you will have to flatten the arrays first. You could then get the unique bin assignments, and then iterate over those in conjunction with np.where
to split the the assigments up into groups. This will probably be faster if the number of bins is much smaller than the number of elements that need to be binned.
As a somewhat trivial example that you will need to tweak/elaborate on for your particular problem (but is hopefully enough to get you started with with a numpy solution):
In [1]: import numpy as np
In [2]: x = np.random.normal(size=(50,))
In [3]: b = np.linspace(-20,20,50)
In [4]: assign = np.digitize(x,b)
In [5]: assign
Out[5]:
array([23, 25, 25, 25, 24, 26, 24, 26, 23, 24, 25, 23, 26, 25, 27, 25, 25,
25, 25, 26, 26, 25, 25, 26, 24, 23, 25, 26, 26, 24, 24, 26, 27, 24,
25, 24, 23, 23, 26, 25, 24, 25, 25, 27, 26, 25, 27, 26, 26, 24])
In [6]: uid = np.unique(assign)
In [7]: adict = {}
In [8]: for ii in uid:
...: adict[ii] = np.where(assign == ii)[0]
...:
In [9]: adict
Out[9]:
{23: array([ 0, 8, 11, 25, 36, 37]),
24: array([ 4, 6, 9, 24, 29, 30, 33, 35, 40, 49]),
25: array([ 1, 2, 3, 10, 13, 15, 16, 17, 18, 21, 22, 26, 34, 39, 41, 42, 45]),
26: array([ 5, 7, 12, 19, 20, 23, 27, 28, 31, 38, 44, 47, 48]),
27: array([14, 32, 43, 46])}
For dealing with flattening and then unflattening numpy arrays, see:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel_multi_index.html
np.searchsorted
is your friend. As I read somewhere here in another answer to the same topic, it's currently a good bit faster than digitize, and does the same job.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html