Normalizing histogram bins in gnuplot

I'm trying to plot a histogram whose bins are normalized by the number of elements in the bin.

I'm using the following

binwidth=5
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes

to get a basic histogram, but I want the value of each bin to be divided by the size of the bin. How can I go about this in gnuplot, or using external tools if necessary?

标签： gnuplot histogram normalize bins

5条回答

\"骚年 ilove

2楼-- · 2019-03-16 08:45

Here is how I would do, with n=500 random gaussian variates generated from R with the following command:

Rscript -e 'cat(rnorm(500), sep="\\n")' > rnd.dat

I use quite the same idea as yours for defining a normalized histogram, where y is defined as 1/(binwidth * n), except that I use int instead of floor and I didn't recenter at the bin value. In short, this is a quick adaptation from the smooth.dem demo script, and a similar approach is described in Janert's textbook, Gnuplot in Action (Chapter 13, p. 257, freely available). You can replace my sample data file with random-points which is available in the demo folder coming with Gnuplot. Note that we need to specify the number of points as Gnuplot as no counting facilities for records in a file.

bw1=0.1
bw2=0.3
n=500
bin(x,width)=width*int(x/width)
set xrange [-3:3]
set yrange [0:1]
tstr(n)=sprintf("Binwidth = %1.1f\n", n) 
set multiplot layout 1,2
set boxwidth bw1
plot 'rnd.dat' using (bin($1,bw1)):(1./(bw1*n)) smooth frequency with boxes t tstr(bw1)
set boxwidth bw2
plot 'rnd.dat' using (bin($1,bw2)):(1./(bw2*n)) smooth frequency with boxes t tstr(bw2)

Here is the result, with two bin width

enter image description here

Besides, this really is a rough approach to histogram and more elaborated solutions are readily available in R. Indeed, the problem is how to define a good bin width, and this issue has already been discussed on stats.stackexchange.com: using Freedman-Diaconis binning rule should not be too difficult to implement, although you'll need to compute the inter-quartile range.

Here is how R would proceed with the same data set, with default option (Sturges rule, because in this particular case, this won't make a difference) and equally spaced bin like the ones used above.

enter image description here

The R code that was used is given below:

par(mfrow=c(1,2), las=1)
hist(rnd, main="Sturges", xlab="", ylab="", prob=TRUE)
hist(rnd, breaks=seq(-3.5,3.5,by=.1), main="Binwidth = 0.1", 
     xlab="", ylab="", prob=TRUE)

You can even look at how R does its job, by inspecting the values returned when calling hist():

> str(hist(rnd, plot=FALSE))
List of 7
 $ breaks     : num [1:14] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ...
 $ counts     : int [1:13] 1 1 12 20 49 79 108 87 71 43 ...
 $ intensities: num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ density    : num [1:13] 0.004 0.004 0.048 0.08 0.196 0.316 0.432 0.348 0.284 0.172 ...
 $ mids       : num [1:13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ...
 $ xname      : chr "rnd"
 $ equidist   : logi TRUE
 - attr(*, "class")= chr "histogram"

All that to say that you can use R results to process your data with Gnuplot if you like (although I would recommend to use R directly :-).

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2019-03-16 08:50

Another way of counting the number of data points in a file is by using a system command. This proves useful if you are plotting multiple files, and you don't know the number of points beforehand. I used:

countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )
file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)
...

The countpoints functions avoids counting lines that start with '#'. You would then use the already mentioned functions to plot the normalized histogram.

Here's a complete example:

n=100
xmin=-50.
xmax=50.
binwidth=(xmax-xmin)/n

bin(x,width)=width*floor(x/width)+width/2.0
countpoints(file) = system( sprintf("grep -v '^#' %s| wc -l", file) )

file1count = countpoints (file1)
file2count = countpoints (file2)
file3count = countpoints (file3)

plot file1 using (bin(($1),binwidth)):(1.0/(binwidth*file1count)) smooth freq with boxes,\
     file2 using (bin(($1),binwidth)):(1.0/(binwidth*file2count)) smooth freq with boxes,\
     file3 using (bin(($1),binwidth)):(1.0/(binwidth*file3count)) smooth freq with boxes
...

0人赞添加讨论(0) 举报

The star\"

4楼-- · 2019-03-16 08:57

In gnuplot 4.4, functions take on a different property, in that they can execute multiple successive commands, and then return a value (see gnuplot tricks) This means that you can actually calculate the number of points, n, within the gnuplot file without having to know it in advance. This code runs for a file, "out.dat", containing one column: a list of n samples from a normal distribution:

binwidth = 0.1
set boxwidth binwidth
sum = 0

s(x)          = ((sum=sum+1), 0)
bin(x, width) = width*floor(x/width) + binwidth/2.0

plot "out.dat" u ($1):(s($1))
plot "out.dat" u (bin($1, binwidth)):(1.0/(binwidth*sum)) smooth freq w boxes

The first plot statement reads through the datafile and increments sum once for each point, plotting a zero.

The second plot statement actually uses the value of sum to normalise the histogram.

0人赞添加讨论(0) 举报

迷人小祖宗

5楼-- · 2019-03-16 08:58

Simply

plot 'file' using (bin($2, binwidth)):($4/$4) smooth freq with boxes

0人赞添加讨论(0) 举报

够拽才男人

6楼-- · 2019-03-16 09:00

In gnuplot 4.6, you can count the number of points by stats command, which is faster than plot. Actually, you do not need such a trick s(x)=((sum=sum+1),0), but directly count the number by variable STATS_records after running of stats 'out.dat' u 1.

0人赞添加讨论(0) 举报

Normalizing histogram bins in gnuplot

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间