I am using ggplot2 to make a histogram:
geom_histogram(aes(x=...), y="..ncount../sum(..ncount..)")
and I get the error:
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
What causes this in general? I am confused about the error because I'm not mapping a variable to y
, just histogram-ing x
and would like the height of the histogram bar to represent a normalized fraction of the data (such that all the bar heights together sum to 100% of the data.)
edit: if I want to make a density plot geom_density
instead of geom_histogram
, do I use ..ncount../sum(..ncount..)
or ..scaled..
? I'm unclear about what ..scaled..
does.
The confusion here is a long standing one (as evidenced by the verbose warning message) that all starts with stat_bin
.
But users don't typically realize that their confusion revolves around stat_bin
, since they typically encounter problems while using either geom_bar
or geom_histogram
. Note the documentation for each: they both use stat = "bin"
(in current ggplot2 versions this stat has been split into stat_bin
for continuous data and stat_count
for discrete data) by default.
But let's back up. geom_*
's control the actual rendering of data into some sort of geometric form. stat_*
's simply transform your data. The distinction is a bit confusing in practice, because adding a layer of stat_bin
will, by default, invoke geom_bar
and so it can seem indistinguishable from geom_bar
when you're learning.
In any case, consider the "bar"-like geom's: histograms and bar charts. Both are clearly going to involve some binning of data somewhere along the line. But our data could either be pre-summarised or not. For instance, we might want a bar plot from:
x
a
a
a
b
b
b
or equivalently from
x y
a 3
b 3
The first hasn't been binned yet. The second is pre-binned. The default behavior for both geom_bar
and geom_histogram
is to assume that you have not pre-binned your data. So they will attempt to call stat_bin
(for histograms, now stat_count
for bar charts) on your x
values.
As the warning says, it will then try to map y
for you to the resulting counts. If you also attempt to map y
yourself to some other variable you end up in Here There Be Dragons territory. Mapping y
to functions of the variables returned by stat_bin
(..count..
, etc.) should be ok and should not throw that warning (it doesn't for me using @mnel's example above).
The take-away here is that for geom_bar
if you've pre-computed the heights of the bars, always remember to use stat = "identity"
, or better yet use the newer geom_col
which uses stat = "identity"
by default. For geom_histogram
it's very unlikely that you will have pre-computed the bins, so in most cases you just need to remember not to map y
to anything beyond what's returned from stat_bin
.
geom_dotplot
uses it's own binning stat, stat_bindot
, and this discussion applies here as well, I believe. This sort of thing generally hasn't been an issue with the 2d binning cases (geom_bin2d
and geom_hex
) since there hasn't been as much flexibility available in the analogous z
variable to the binned y
variable in the 1d case. If future updates start allowing more fancy manipulations of the 2d binning cases this could I suppose become something you have to watch out for there.
The documentation for geom_histogram
states that it is an alias for stat_bin
and geom_bar
The documentation for geom_density
states that uses a smooth density estimate produced using stat_density
Following the links (or finding the help pages directly)
stat_bin
The documentation for stat_bin
describes how stat_bin
returns a data.frame with the following (additional) columns
count number of points in bin
density density of points in bin, scaled to integrate to 1
ncount count, scaled to maximum of 1
ndensity density, scaled to maximum of 1
stat_density
The documentation for stat_density
describes how stat_density
returns a data.frame with the following (additional) columns
density density estimate
count density * number of points - useful for stacked density plots
scaled density estimate, scaled to maximum of 1
To produce a plot on the same scale it would appear that you want ..ndensity..
from stat_bin
and ..scaled..
from stat_density
or ..density..
from both
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..density..)) +
geom_density(aes(y=..density..))
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..ndensity..)) +
geom_density(aes(y=..scaled..))