可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a set of data, and want to make an histogram of it. I need the bins to have the same size, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having equally spaced bins.
This will naturally come at the expenses of the bins widths, which can - and in general will - be different.
I will specify the number of desired bins and the data set, obtaining the bins edges in return.
Example:
data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
bins_edges = somefunc(data, nbins=3)
print(bins_edges)
>> [1.,1.3,2.1,2.12]
So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.
There are two limitations:
- if a group of data is identical, the bin containing them could be bigger.
- if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.
This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?
1 import numpy as np
2
3 def def_equbin(in_distr, binsize=None, bin_num=None):
4
5 try:
6
7 distr_size = len(in_distr)
8
9 bin_size = distr_size / bin_num
10 odd_bin_size = distr_size % bin_num
11
12 args = in_distr.argsort()
13
14 hist = np.zeros((bin_num, bin_size))
15
16 for i in range(bin_num):
17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]]
18
19 if odd_bin_size == 0:
20 odd_bin = None
21 bins_limits = np.arange(bin_num) * bin_size
22 bins_limits = args[bins_limits]
23 bins_limits = np.concatenate((in_distr[bins_limits],
24 [in_distr[args[-1]]]))
25 else:
26 odd_bin = in_distr[args[bin_num * bin_size:]]
27 bins_limits = np.arange(bin_num + 1) * bin_size
28 bins_limits = args[bins_limits]
29 bins_limits = in_distr[bins_limits]
30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]]))
31
32 return (hist, odd_bin, bins_limits)
回答1:
Using your example case (bins of 2 points, 6 total data points):
from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])
回答2:
I would like to mention also the existence of pandas.qcut
, which does equi-populated binning in quite an efficient way. In your case it would work something like
data = np.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
# parameter q specifies the number of bins
qc = pd.qcut(data, q=3, precision=1)
# bin definition
bins = qc.categories
print(bins)
>> Index(['[1, 1.3]', '(1.3, 2.03]', '(2.03, 2.1]'], dtype='object')
# bin corresponding to each point in data
codes = qc.codes
print(codes)
>> array([0, 0, 1, 1, 2, 2], dtype=int8)
回答3:
Update for skewed distributions :
I came across the same problem as @astabada, wanting to create bins each containing an equal number of samples. When applying the solution proposed @aganders3, I found that it didn't work particularly well for skewed distributions. In the case of skewed data (for example something with a whole lot of zeros), stats.mstats.mquantiles
for a predefined number of quantiles will not guarantee an equal number of samples in each bin. You will get bin edges that look like this :
[0. 0. 4. 9.]
In which case the first bin will be empty.
In order to deal with skewed cases, I created a function that calls stats.mstats.mquantiles
and then dynamically modifies the number of bins if samples are not equal within a certain tolerance (30% of the smallest sample size in the example code). If samples are not equal between bins, the code reduces the number of equally-spaced quantiles by 1 and calls stats.mstats.mquantiles
again until sample sizes are equal or only one bin exists.
I hard coded the tolerance in the example, but this could be modified to a keyword argument if desired.
I also prefer giving the number of equally spaced quantiles as an argument to my function instead of giving user defined quantiles to stats.mstats.mquantiles
in order to reduce accidental errors (i.e. something like [0., 0.25, 0.7, 1.]
).
Here's the code :
import numpy as np
from scipy import stats
def equibins(dat, binnum, **kwargs):
numin = binnum
while numin>1.:
qtls = np.linspace(0.,1.0,num=numin,endpoint=False)
ebins =stats.mstats.mquantiles(dat,qtls,alphap=kwargs['alpha'],betap=kwargs['beta'])
allhist, allbin = np.histogram(dat, bins = ebins)
if (np.unique(ebins).shape!=ebins.shape or tolerence(allhist,0.3)==False) and numin>2:
numin= numin-1
del qtls, ebins
else:
numin=0
return ebins
def tolerence(narray, percent):
if percent>1.0:
per = percent/100.
else:
per = percent
lev_tol = per*narray.min()
tolerate = np.all(narray[1:]-narray[0]<lev_tol)
return tolerate
回答4:
Just sort the data, and divide it into fixed bins by length! Obviously you can never divide into exactly equally populated bins, if the number of samples does not divide exactly by the number of bins.
import math
import numpy as np
data = np.array([2,3,5,6,8,5,5,6,3,2,3,7,8,9,8,6,6,8,9,9,0,7,5,3,3,4,5,6,7])
data_sorted = np.sort(data)
nbins = 3
step = math.ceil(len(data_sorted)//nbins+1)
binned_data = []
for i in range(0,len(data_sorted),step):
binned_data.append(data_sorted[i:i+step])