I'm trying to plot a histogram whose bins are normalized by the number of elements in the bin.
I'm using the following
binwidth=5
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'file' using (bin($2, binwidth)):($4) smooth freq with boxes
to get a basic histogram, but I want the value of each bin to be divided by the size of the bin. How can I go about this in gnuplot, or using external tools if necessary?
Here is how I would do, with n=500 random gaussian variates generated from R with the following command:
I use quite the same idea as yours for defining a normalized histogram, where y is defined as 1/(binwidth * n), except that I use
int
instead offloor
and I didn't recenter at the bin value. In short, this is a quick adaptation from the smooth.dem demo script, and a similar approach is described in Janert's textbook, Gnuplot in Action (Chapter 13, p. 257, freely available). You can replace my sample data file withrandom-points
which is available in thedemo
folder coming with Gnuplot. Note that we need to specify the number of points as Gnuplot as no counting facilities for records in a file.Here is the result, with two bin width
Besides, this really is a rough approach to histogram and more elaborated solutions are readily available in R. Indeed, the problem is how to define a good bin width, and this issue has already been discussed on stats.stackexchange.com: using Freedman-Diaconis binning rule should not be too difficult to implement, although you'll need to compute the inter-quartile range.
Here is how R would proceed with the same data set, with default option (Sturges rule, because in this particular case, this won't make a difference) and equally spaced bin like the ones used above.
The R code that was used is given below:
You can even look at how R does its job, by inspecting the values returned when calling
hist()
:All that to say that you can use R results to process your data with Gnuplot if you like (although I would recommend to use R directly :-).
Another way of counting the number of data points in a file is by using a system command. This proves useful if you are plotting multiple files, and you don't know the number of points beforehand. I used:
The
countpoints
functions avoids counting lines that start with '#'. You would then use the already mentioned functions to plot the normalized histogram.Here's a complete example:
In gnuplot 4.4, functions take on a different property, in that they can execute multiple successive commands, and then return a value (see gnuplot tricks) This means that you can actually calculate the number of points, n, within the gnuplot file without having to know it in advance. This code runs for a file, "out.dat", containing one column: a list of n samples from a normal distribution:
The first plot statement reads through the datafile and increments sum once for each point, plotting a zero.
The second plot statement actually uses the value of sum to normalise the histogram.
Simply
In gnuplot 4.6, you can count the number of points by
stats
command, which is faster thanplot
. Actually, you do not need such a tricks(x)=((sum=sum+1),0)
, but directly count the number by variableSTATS_records
after running ofstats 'out.dat' u 1
.