Error with ggplot2 mapping variable to y and using

2020-02-03 04:19发布

问题:

I am using ggplot2 to make a histogram:

geom_histogram(aes(x=...), y="..ncount../sum(..ncount..)")

and I get the error:

Mapping a variable to y and also using stat="bin".
  With stat="bin", it will attempt to set the y value to the count of cases in each group.
  This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
  If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
  If you want y to represent values in the data, use stat="identity".
  See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)

What causes this in general? I am confused about the error because I'm not mapping a variable to y, just histogram-ing x and would like the height of the histogram bar to represent a normalized fraction of the data (such that all the bar heights together sum to 100% of the data.)

edit: if I want to make a density plot geom_density instead of geom_histogram, do I use ..ncount../sum(..ncount..) or ..scaled..? I'm unclear about what ..scaled.. does.

回答1:

The confusion here is a long standing one (as evidenced by the verbose warning message) that all starts with stat_bin.

But users don't typically realize that their confusion revolves around stat_bin, since they typically encounter problems while using either geom_bar or geom_histogram. Note the documentation for each: they both use stat = "bin" (in current ggplot2 versions this stat has been split into stat_bin for continuous data and stat_count for discrete data) by default.

But let's back up. geom_*'s control the actual rendering of data into some sort of geometric form. stat_*'s simply transform your data. The distinction is a bit confusing in practice, because adding a layer of stat_bin will, by default, invoke geom_bar and so it can seem indistinguishable from geom_bar when you're learning.

In any case, consider the "bar"-like geom's: histograms and bar charts. Both are clearly going to involve some binning of data somewhere along the line. But our data could either be pre-summarised or not. For instance, we might want a bar plot from:

x
a
a
a
b
b
b

or equivalently from

x  y
a  3
b  3

The first hasn't been binned yet. The second is pre-binned. The default behavior for both geom_bar and geom_histogram is to assume that you have not pre-binned your data. So they will attempt to call stat_bin (for histograms, now stat_count for bar charts) on your x values.

As the warning says, it will then try to map y for you to the resulting counts. If you also attempt to map y yourself to some other variable you end up in Here There Be Dragons territory. Mapping y to functions of the variables returned by stat_bin (..count.., etc.) should be ok and should not throw that warning (it doesn't for me using @mnel's example above).

The take-away here is that for geom_bar if you've pre-computed the heights of the bars, always remember to use stat = "identity", or better yet use the newer geom_col which uses stat = "identity" by default. For geom_histogram it's very unlikely that you will have pre-computed the bins, so in most cases you just need to remember not to map y to anything beyond what's returned from stat_bin.

geom_dotplot uses it's own binning stat, stat_bindot, and this discussion applies here as well, I believe. This sort of thing generally hasn't been an issue with the 2d binning cases (geom_bin2d and geom_hex) since there hasn't been as much flexibility available in the analogous z variable to the binned y variable in the 1d case. If future updates start allowing more fancy manipulations of the 2d binning cases this could I suppose become something you have to watch out for there.



回答2:

The documentation for geom_histogram states that it is an alias for stat_bin and geom_bar

The documentation for geom_density states that uses a smooth density estimate produced using stat_density

Following the links (or finding the help pages directly)

stat_bin

The documentation for stat_bin describes how stat_bin returns a data.frame with the following (additional) columns

count number of points in bin

density density of points in bin, scaled to integrate to 1

ncount count, scaled to maximum of 1

ndensity density, scaled to maximum of 1

stat_density

The documentation for stat_density describes how stat_density returns a data.frame with the following (additional) columns

density density estimate

count density * number of points - useful for stacked density plots

scaled density estimate, scaled to maximum of 1


To produce a plot on the same scale it would appear that you want ..ndensity.. from stat_bin and ..scaled.. from stat_density or ..density.. from both

ggplot(dd, aes(x=x)) + 
  geom_histogram(aes(y= ..density..)) +  
  geom_density(aes(y=..density..))


ggplot(dd, aes(x=x)) + 
  geom_histogram(aes(y= ..ndensity..)) + 
  geom_density(aes(y=..scaled..))


标签: r ggplot2