I am generating many, many contingency tables as part of a project that I'm writing.
The workflow is:
- Take a large data array with continuous (float) rows and convert those to discrete integer values by binning (so that the resulting row has values 0-9, for example)
- Slice two rows into vectors X & Y and generate a contingency table from them, so that I have the 2-dimensional frequency distribution
- For example, I'd have a 10 x 10 array, counting the number of (xi, yi) that occur
- Use the contingency table to do some information theory math
Initially, I wrote this as:
def make_table(x, y, num_bins):
ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))
for xn, yn in zip(x, y):
ctable[xn, yn] += 1
return ctable
This works fine, but is so slow that it's eating up like 90% of the runtime of the entire project.
The fastest python-only optimization I've been able to come up with is this:
def make_table(x, y, num_bins):
ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))
reindex = np.dot(np.stack((x, y)).transpose(),
np.array([num_bins, 1]))
idx, count = np.unique(reindex, return_counts=True)
for i, c in zip(idx, count):
ctable[i] = c
return ctable.reshape((num_bins, num_bins))
That's (somehow) a lot faster, but it's still pretty expensive for something that doesn't seem like it should be a bottleneck. Are there any efficient ways to do this that I'm just not seeing, or should I just give up and do this in cython?
Also, here's a benchmarking function.
def timetable(func):
size = 5000
bins = 10
repeat = 1000
start = time.time()
for i in range(repeat):
x = np.random.randint(0, bins, size=size)
y = np.random.randint(0, bins, size=size)
func(x, y, bins)
end = time.time()
print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))