I have a df with multiple y-series which I want to plot individually, so I wrote a fn that selects one particular series, assigns to a local variable dat
, then plots it. However ggplot/geom_step when called inside the fn doesn't treat it properly like a single series. I don't see how this can be a scoping issue, since if dat
wasn't visible, surely ggplot would fail?
You can verify the code is correct when executed from the toplevel environment, but not inside the function. This is not a duplicate question. I understand the problem (this is a recurring issue with ggplot), but I've read all the other answers; this is not a duplicate and they do not give the solution.
set.seed(1234)
require(ggplot2)
require(scales)
N = 10
df <- data.frame(x = 1:N,
id_ = c(rep(20,N), rep(25,N), rep(33,N)),
y = c(runif(N, 1.2e6, 2.9e6), runif(N, 5.8e5, 8.9e5) ,runif(N, 2.4e5, 3.3e5)),
row.names=NULL)
plot_series <- function(id_, envir=environment()) {
dat <- subset(df,id_==id_)
p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
# Unsuccessfully trying the approach from http://stackoverflow.com/questions/22287498/scoping-of-variables-in-aes-inside-a-function-in-ggplot
p$plot_env <- envir
plot(p)
# Displays wrongly whether we do the plot here inside fn, or return the object to parent environment
return(p)
}
# BAD: doesn't plot geom_step!
plot_series(20)
# GOOD! but what's causing the difference?
ggplot(data=subset(df,id_==20), mapping=aes(x,y), color='red') + geom_step()
#plot_series(25)
#plot_series(33)
This works fine:
If you simply step through the original function using
debug
, you'll quickly see that thesubset
line did not actually subset the data frame at all: it returned all rows!Why? Because
subset
uses non-standard evaluation and you used the same name for both the column name and the function argument. As jlhoward demonstrates above, it would have worked (but probably not been advisable) to have simply used different names for the two.The reason is that
subset
evaluates with the data frame first. So all it sees in the logical expression is the always trueid_ == id_
within that data frame.One way to think about it is to play dumb (like a computer) and ask yourself when presented with the condition
id_ == id_
how do you know what exactly each symbol refers to. It's ambiguous, andsubset
makes a consistent choice: use what's in the data frame.Notwithstanding the comments, this works:
The problem seems to be the subset is interpreting
id_
on the RHS of the==
as identical to the LHS, to this is equivalent to subletting onT
, which of course includes all the rows ofdf
. That's the plot you are seeing.