可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to compute the mutual information, and so the shannon entropy of N variables.

I wrote a code that compute shannon entropy of certain distribution. Let's say that I have a variable x, array of numbers. Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.

import scipy.integrate as scint
from numpy import*
from scipy import*

def shannon_entropy(a, bins):

p,binedg= histogram(a,bins,normed=True)
p=p/len(p)

x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.

return scint.simps(g,x=x)

Choosing inserting x, and carefully the bin number this function works.

But this function is very dependent on the bin number: choosing different values of this parameter I got different values.

Particularly if my input is an array of values constant:

x=[0,0,0,....,0,0,0]

the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:

p,binedg= histogram(a,bins,normed=True)
p=p/len(p)

I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.

I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error. The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation

def intNd(c,axes):

assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
    return scint.simps(c,axes[0])
else:
    return intNd(scint.simps(c,axes[-1]),axes[:-1])



def shannon_entropydd(c,bins=30):



hist,ax=histogramdd(c,bins,normed=True)

for i in range(len(ax)):
    ax[i]=ax[i][:-1]

p=-hist*log2(hist)

p[isnan(p)]=0

return intNd(p,ax)

I need these quantities in order to be able to compute the mutual information between certain set of variables:

M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)

where H(x) is the shannon entropy of the variable x

I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!

回答1:

The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.

I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.

[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf

回答2:

I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...

I must say that I did not read your code, but this works for me:

import numpy as np

x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]

def entropy(c):
    c_normalized = c/float(np.sum(c))
    c_normalized = c_normalized[np.nonzero(c_normalized)]
    h = -sum(c_normalized * np.log(c_normalized))  
    return h

hx = entropy(cx)