For an m x n matrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?
By mutual information, I mean:
I(X, Y) = H(X) + H(Y) - H(X,Y)
where H(X) refers to the Shannon entropy of X.
Currently I'm using np.histogram2d
and np.histogram
to calculate the joint (X,Y) and individual (X or Y) counts. For a given matrix A
(e.g. a 250000 X 1000 matrix of floats), I am doing a nested for
loop,
n = A.shape[1]
for ix = arange(n)
for jx = arange(ix+1,n):
matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])
Surely there must be better/faster ways to do this?
As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.
Here is my full implementation, following the conventions in the Wiki page:
import numpy as np
def calc_MI(X,Y,bins):
c_XY = np.histogram2d(X,Y,bins)[0]
c_X = np.histogram(X,bins)[0]
c_Y = np.histogram(Y,bins)[0]
H_X = shan_entropy(c_X)
H_Y = shan_entropy(c_Y)
H_XY = shan_entropy(c_XY)
MI = H_X + H_Y - H_XY
return MI
def shan_entropy(c):
c_normalized = c / float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
H = -sum(c_normalized* np.log2(c_normalized))
return H
A = np.array([[ 2.0, 140.0, 128.23, -150.5, -5.4 ],
[ 2.4, 153.11, 130.34, -130.1, -9.5 ],
[ 1.2, 156.9, 120.11, -110.45,-1.12 ]])
bins = 5 # ?
n = A.shape[1]
matMI = np.zeros((n, n))
for ix in np.arange(n):
for jx in np.arange(ix+1,n):
matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)
Although my working version with nested for
loops does it at reasonable speed, I'd like to know if there is a more optimal way to apply calc_MI
on all the columns of A
(to calculate their pairwise mutual information)?
I'd also like to know:
Whether there are efficient ways to map functions to operate on columns (or rows) of
np.arrays
(maybe likenp.vectorize
, which looks more like a decorator)?Whether there are other optimal implementations for this specific calculation (mutual information)?