I am generating many, many contingency tables as part of a project that I'm writing.
The workflow is:
- Take a large data array with continuous (float) rows and convert those to discrete integer values by binning (so that the resulting row has values 0-9, for example)
- Slice two rows into vectors X & Y and generate a contingency table from them, so that I have the 2-dimensional frequency distribution
- For example, I'd have a 10 x 10 array, counting the number of (xi, yi) that occur
- Use the contingency table to do some information theory math
Initially, I wrote this as:
def make_table(x, y, num_bins):
ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))
for xn, yn in zip(x, y):
ctable[xn, yn] += 1
return ctable
This works fine, but is so slow that it's eating up like 90% of the runtime of the entire project.
The fastest python-only optimization I've been able to come up with is this:
def make_table(x, y, num_bins):
ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))
reindex = np.dot(np.stack((x, y)).transpose(),
np.array([num_bins, 1]))
idx, count = np.unique(reindex, return_counts=True)
for i, c in zip(idx, count):
ctable[i] = c
return ctable.reshape((num_bins, num_bins))
That's (somehow) a lot faster, but it's still pretty expensive for something that doesn't seem like it should be a bottleneck. Are there any efficient ways to do this that I'm just not seeing, or should I just give up and do this in cython?
Also, here's a benchmarking function.
def timetable(func):
size = 5000
bins = 10
repeat = 1000
start = time.time()
for i in range(repeat):
x = np.random.randint(0, bins, size=size)
y = np.random.randint(0, bins, size=size)
func(x, y, bins)
end = time.time()
print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))
The clever trick for representing the elements of
np.stack((x, y))
as integers can be made faster:Moreover, the last part of your second solution can be simplified somewhat, by simply considering
What is more, since we are dealing with equally spaced non-negative integers,
np.bincount
will outperformnp.unique
; with that, the above boils down toAll in all, this provides quite some performance over what you are currently doing:
You may also want to be aware of
np.histogramdd
which takes care of both rounding and binning at once, although it's likely going to be slower than rounding and usingnp.bincount
.