I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the histogram bins. I put here an example with left-closed bins (i.e. [ 0, 0.1 [ ) with a binwidth of 0.1.
mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left")
myplot
ggplot_build(myplot)$data[[1]]
On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.
Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?
Thanks in advance, Best regards, Arnaud
PS: Also asked here: https://github.com/hadley/ggplot2/issues/1651
Edit: The problem described below was fixed in a recent release of
ggplot2
.Your issue is reproducible and appears to be caused by rounding errors, as suggested in the comments by Roland. At this point, this looks to me like a bug introduced in version
ggplot2_2.0.0
. I speculate below about its origin, but first let me present a workaround based on theboundary
option.PROBLEM:
SOLUTION
Tweak the
boundary
parameter. In this example, setting just below 1, say 0.99, works. Your use case should be amenable to tweaking too.(I have made the binwidth narrower for better visual)
Another workaround is to introduce your own fuzziness, e.g. multiply the data by 1 plus slightly less than the machine zero (see
eps
below). Inggplot2
the fuzziness multiplies by 1e-7 (earlier versions) or 1e-8 (later versions).CAUSE:
The problem appears clearly in
ncount
:ROUNDING ERRORS?
Looks like:
(I have removed the
boundary
option altogether)This behaviour appears some time after
ggplot2_1.0.1
. Looking at the source code, e.g.bin.R
andstat-bin.r
inhttps://github.com/hadley/ggplot2/blob/master/R
, and tracing the computations ofcount
leads to functionbin_vector()
, which contains the following lines:By comparing the current versions of these functions with older ones, you should be able to find the reason for the different behaviour... to be continued...
SUMMING UP DEBUGGING
By
"patching"
thebin_vector
function and printing the output to screen, it appears that:bins$fuzzy
correctly stores the fuzzy parametersThe non-fuzzy
bins$breaks
are used in the computations, but as far as I can see (and correct me if I'm wrong) thebins$fuzzy
are not.If I simply replace
bins$breaks
withbins$fuzzy
at the top ofbin_vector
, the correct plot is returned. Not a proof of a bug, but a suggestion that perhaps more could be done to emulate the behaviour of previous versions ofggplot2
.At the top of
bin_vector
I expected to find a condition upon which to return eitherbins$breaks
orbins$fuzzy
. I think that's missing now.PATCHING
To
"patch"
thebin_vector
function, copy the function definition from the github source or, more conveniently, from the terminal, with:Modify it (patch it) and assign it into the namespace:
Just to be clear, the code above is edited for clarity: the function has a lot of type-checking and other calculations which I have removed, but which you would need to patch the function. Before you run the patch, restart your R session or
detach
your currently loadedggplot2
.OLD VERSIONS
The unexpected behaviour is NOT observed in versions
2.0.9.3
or2.1.0.1
and appears to originate in the current release2.2.0.1
(or perhaps the earlier2.2.0.0
, which gave me an error when I tried to call it).To install and load an old version, say
ggplot2_0.9.3
, create a separate directory (no point in overwriting the current version), sayggplot2093
:To load the old version, call it from your local directory: