boxplots with formulas in r [closed]

2020-08-09 04:56发布

问题:

I can't understand how formulas work in R, what formulas are. So, if I have a vector containing a monthly time series, how could I create a boxplot of it, where datas are divided in a seasonal logic? I would like to have 12 boxes, one per month.

回答1:

At the storage level, a formula is a parse tree. The parse tree encodes a call to the `~`() function taking one or two arguments. For a one-sided formula, it takes a single argument which represents the RHS, while for a two-sided formula it takes two arguments which represent the LHS and RHS of the formula.

It should be mentioned that the call to `~`() which is embedded in a formula's parse tree storage representation doesn't really mean anything. In general, the `~`() function doesn't actually do anything, other than allowing for the creation of formula objects, either explicitly (e.g. `~`(a+b,c/d)) or using the syntactic sugar feature provided by the R language (e.g. a+b~c/d). The use of the `~`() function in the encoding of the top level of the formula's parse tree storage representation is a fairly arbitrary and inconsequential implementation detail. I'll expand on this later on.


The R language makes it possible to break down parse trees into recursive list structures, which can help us to inspect and understand the structure of these parse trees.

I've written a short recursive function that can do this:

ptunwrap <- function(x) if (typeof(x)=='language') lapply(as.list(x),ptunwrap) else x;

So let's look at an example:

f1 <- a+b~c/d;
f1;
## a + b ~ c/d
ptunwrap(f1);
## [[1]]
## `~`
##
## [[2]]
## [[2]][[1]]
## `+`
##
## [[2]][[2]]
## a
##
## [[2]][[3]]
## b
##
##
## [[3]]
## [[3]][[1]]
## `/`
##
## [[3]][[2]]
## c
##
## [[3]][[3]]
## d
##

When a parse tree contains a function call, the function call is represented as a single list node within the recursive list structure. The symbol of the function is embedded as the first list component of that list, and its arguments are embedded as the subsequent list components of the list.

There are three function calls in the above parse tree. The top-level list represents the call to the `~`() function, as I described earlier. The second top-level list component further branches into another list, which consists of a call to the `+`() function, which itself takes two arguments, the symbols a and b. The third component is similar, representing a call to the `/`() function and again taking two arguments, symbols c and d.


It is important to understand that although a parse tree always represents syntactically valid R code, and has the ability to be evaluated to produce a single result value, it is not necessary to evaluate a parse tree. It is perfectly possible to create a parse tree and never evaluate it.

What, then, is the purpose of creating a parse tree that you never evaluate? In R this is often done to facilitate communication of certain pieces of information to functions using a convenient syntax.

As a random example, the data.table package allows for the following syntactic sugar to add a column to an existing data.table, using the := operator:

library(data.table);
dt <- data.table(a=1:3);
dt[,b:=a*2L];
dt;
##    a b
## 1: 1 2
## 2: 2 4
## 3: 3 6

Internally, data.table uses non-standard argument evaluation to retrieve the parse tree of the second argument (technically third in the function definition; second in syntactic-sugar form) to the `[.data.table`() function, often referred to as the "j argument" due to the parameter name being j. If you want, you can actually examine the source itself directly in R. Here's a snippet of the most relevant piece of code:

data.table:::`[.data.table`;
## function (x, i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"),
##     mult = "all", roll = FALSE, rollends = if (roll == "nearest") c(TRUE,
##         TRUE) else if (roll >= 0) c(FALSE, TRUE) else c(TRUE,
##         FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"),
##     allow.cartesian = getOption("datatable.allow.cartesian"),
##     drop = NULL, on = NULL)
## {
##
## ... snip ...
##
##     if (!missing(j)) {
##         jsub = substitute(j)
##         if (is.call(jsub))
##             jsub = construct(deconstruct_and_eval(replace_dot(jsub),
##                 parent.frame(), parent.frame()))
##     }
##
## ... snip ...
##

We can see they're using substitute(j) to retrieve the parse tree of the j argument. For the demo above, this is what they would get:

ptunwrap(substitute(b:=a*2L));
## [[1]]
## `:=`
##
## [[2]]
## b
##
## [[3]]
## [[3]][[1]]
## `*`
##
## [[3]][[2]]
## a
##
## [[3]][[3]]
## [1] 2
##

Later on in the code, they test if the top-level function symbol is :=, which is the operator supported for adding (or modifying, or removing with a RHS of NULL) columns to the data.table. If so, they test if the LHS consists of a single bareword, which is taken as the name of the column to add (or modify or remove). Observe that it is, in fact, not possible for them to actually evaluate the LHS of the parse tree in this case, because it consists of a symbol that does not yet exist in the data.table. However, the RHS does end up being evaluated to produce the column vector to be added to the data.table under the new name.

So it should be clear that formulas can be used in a variety of contexts in R, and they are not always evaluated. Sometimes the parse tree is simply inspected to retrieve information passed from the caller to the callee. Even in contexts where they are evaluated, sometimes only the LHS or RHS (or both) will be evaluated independently, ignoring the top-level function symbol that was embedded in the parse tree at the time it was created.


Turning to the boxplot() function, let's look at the documentation on the formula argument:

a formula, such as y ~ grp, where y is a numeric vector of data values to be split into groups according to the grouping variable grp (usually a factor).

In this case, both sides of the formula end up being evaluated independently, the LHS providing the data vector, and the RHS providing the grouping definition.

A good way to demonstrate this is as follows:

boxplot(1:9~1:9%%3L);

Notice how both sides of the formula consist of literal expressions:

1:9;
## [1] 1 2 3 4 5 6 7 8 9
1:9%%3L;
## [1] 1 2 0 1 2 0 1 2 0

Internally, boxplot() had to independently evaluate each side of the formula to produce the data and grouping vectors, almost as if you had passed the two expressions as separate arguments.


So, let's create a simple demonstration of a monthly time series boxplot:

N <- 36L; df <- data.frame(date=seq(as.Date('2016-01-01'),by='month',len=N),y=rnorm(N));
df;
##          date           y
## 1  2016-01-01 -1.56004488
## 2  2016-02-01  0.65699747
## 3  2016-03-01  0.05729631
## 4  2016-04-01 -0.02092276
## 5  2016-05-01  0.46673530
## 6  2016-06-01 -0.18652580
## 7  2016-07-01  0.06228650
## 8  2016-08-01  1.54452267
## 9  2016-09-01  1.06643594
## 10 2016-10-01 -1.51178160
## 11 2016-11-01  0.82904673
## 12 2016-12-01  0.37667201
## 13 2017-01-01 -0.10135801
## 14 2017-02-01  0.94692462
## 15 2017-03-01 -1.60781946
## 16 2017-04-01  0.47189753
## 17 2017-05-01 -1.32869317
## 18 2017-06-01 -0.49821455
## 19 2017-07-01  0.54474606
## 20 2017-08-01  0.47565264
## 21 2017-09-01 -0.97494730
## 22 2017-10-01 -1.22781588
## 23 2017-11-01 -0.34919086
## 24 2017-12-01 -0.78153843
## 25 2018-01-01 -0.59355220
## 26 2018-02-01 -2.58287605
## 27 2018-03-01  1.42148186
## 28 2018-04-01 -1.01278176
## 29 2018-05-01 -0.80961662
## 30 2018-06-01  0.19793126
## 31 2018-07-01 -1.03072915
## 32 2018-08-01 -0.87896416
## 33 2018-09-01 -2.36216655
## 34 2018-10-01  1.82708221
## 35 2018-11-01  0.05579195
## 36 2018-12-01  1.28612246
boxplot(y~months(date),df);


If you want, you can study the source code, which requires tracing the S3 lookup process:

boxplot;
## function (x, ...)
## UseMethod("boxplot")
## <bytecode: 0x600b50760>
## <environment: namespace:graphics>
graphics:::boxplot.formula;
## function (formula, data = NULL, ..., subset, na.action = NULL)
## {
##     if (missing(formula) || (length(formula) != 3L))
##         stop("'formula' missing or incorrect")
##     m <- match.call(expand.dots = FALSE)
##     if (is.matrix(eval(m$data, parent.frame())))
##         m$data <- as.data.frame(data)
##     m$... <- NULL
##     m$na.action <- na.action
##     m[[1L]] <- quote(stats::model.frame)
##     mf <- eval(m, parent.frame())
##     response <- attr(attr(mf, "terms"), "response")
##     boxplot(split(mf[[response]], mf[-response]), ...)
## }
## <bytecode: 0x6035c67f8>
## <environment: namespace:graphics>

It is almost exasperatingly roundabout and complex, but graphics:::boxplot.formula() effectively retrieves (via match.call()) the parse tree that caused its own invocation, massages it a little, most notably replacing its own function symbol boxplot.formula with stats::model.frame, and then evaluates that new parse tree, thereby calling stats::model.frame(). That function is itself very complex, and involves further S3 lookup, but here's the most relevant code:

model.frame;
## function (formula, ...)
## UseMethod("model.frame")
## <bytecode: 0x601464b18>
## <environment: namespace:stats>
model.frame.default;
## function (formula, data = NULL, subset = NULL, na.action = na.fail,
##     drop.unused.levels = FALSE, xlev = NULL, ...)
## {
##
## ... snip ...
##
##     if (!inherits(formula, "terms"))
##         formula <- terms(formula, data = data)
##     env <- environment(formula)
##     rownames <- .row_names_info(data, 0L)
##     vars <- attr(formula, "variables")
##     predvars <- attr(formula, "predvars")
##     if (is.null(predvars))
##         predvars <- vars
##     varnames <- sapply(vars, function(x) paste(deparse(x, width.cutoff = 500),
##         collapse = " "))[-1L]
##     variables <- eval(predvars, data, env)
##
## ... snip ...
##

So, eventually, it retrieves the individual expressions from the formula object, and evaluates them using eval() with the given data.frame and closure environment of the formula as context, which gives the result vectors:

attr(terms(y~months(date),data=df),'variables');
## list(y, months(date))
eval(attr(terms(y~months(date),data=df),'variables'),df);
## [[1]]
##  [1] -1.56004488  0.65699747  0.05729631 -0.02092276  0.46673530 -0.18652580  0.06228650  1.54452267  1.06643594 -1.51178160  0.82904673  0.37667201 -0.10135801  0.94692462 -1.60781946  0.47189753 -1.32869317 -0.49821455  0.54474606
## [20]  0.47565264 -0.97494730 -1.22781588 -0.34919086 -0.78153843 -0.59355220 -2.58287605  1.42148186 -1.01278176 -0.80961662  0.19793126 -1.03072915 -0.87896416 -2.36216655  1.82708221  0.05579195  1.28612246
##
## [[2]]
##  [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"   "November"  "December"  "January"   "February"  "March"     "April"     "May"       "June"      "July"
## [20] "August"    "September" "October"   "November"  "December"  "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"   "November"  "December"
##

To reiterate a point made earlier, notice how the `~`() function is nowhere to be found in the evaluation process. It is an arbitrary implementation detail of R formulas that they encode the `~`() function as the top-level function symbol in the parse tree storage representation of formula objects. The actual evaluation of the side(s) of the formula, if it occurs, does not involve evaluation of that function.

Finally, let's consider what happens if you do actually evaluate the entire parse tree that comprises the storage representation of a formula. Recall that the `~`() function does nothing more than create a formula from its arguments. Therefore, evaluating a formula has the interesting effect of spitting out the very same formula that was just evaluated:

f1;
## a + b ~ c/d
eval(f1);
## a + b ~ c/d
eval(eval(f1));
## a + b ~ c/d

I've written a couple of other answers on parse trees and formulas which may be of interest to you:

  • What is "{" class in R? -- focuses on the data types involved in parse trees.
  • What does it mean by formulas and closures being able to "capture the enclosing environment" in R? -- focuses on the closure environments of formulas and functions.