Difference between tilde and “by” while using aggr

Every time I do an aggregate on a data.frame I default to using the "by = list(...)" parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.

In some cases, the output is exactly the same. For example:

aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))

AND

aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)

What is the difference between the two and when do you use which?

标签： r dataframe aggregate

2条回答

男人必须洒脱

2楼-- · 2020-03-05 02:52

From the help page,

aggregate.formula is a standard formula interface to aggregate.data.frame

So I don't think it really matters. Use whichever approach you're comfortable with, or which fits existing variables and formulas in your workspace.

0人赞添加讨论(0) 举报

闹够了就滚

3楼-- · 2020-03-05 03:00

I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.

I'll illustrate with a small example.

Here's some sample data:

set.seed(1)
mydf <- data.frame(A = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
                   B = LETTERS[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)],
                   matrix(sample(100, 36, replace = TRUE), nrow = 12))
mydf[3:5] <- lapply(mydf[3:5], function(x) {
  x[sample(nrow(mydf), 1)] <- NA
  x
})
mydf
#    A B X1  X2 X3
# 1  1 A 27  69 27
# 2  1 A 38  NA 39
# 3  1 A 58  77  2
# 4  2 A 91  50 39
# 5  2 A 21  72 87
# 6  3 B 90 100 35
# 7  3 B 95  39 49
# 8  3 B 67  78 60
# 9  3 B 63  94 NA
# 10 4 B NA  22 19
# 11 4 B 21  66 83
# 12 4 B 18  13 67

First, the formula interface. The following three commands will all yield the same output.

aggregate(cbind(X1, X2, X3) ~ A + B, mydf, sum)
aggregate(cbind(X1, X2, X3) ~ ., mydf, sum)
aggregate(. ~ A + B, mydf, sum)
#   A B  X1  X2  X3
# 1 1 A  85 146  29
# 2 2 A 112 122 126
# 3 3 B 252 217 144
# 4 4 B  39  79 150

Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using with, if required).

aggregate(cbind(mydf$X1, mydf$X2, mydf$X3), 
          by = list(mydf$A, mydf$B), sum)
  Group.1 Group.2  V1  V2  V3
1       1       A 123  NA  68
2       2       A 112 122 126
3       3       B 315 311  NA
4       4       B  NA 101 169

Now, stop and make note of any differences.

The two that pop into my mind are:

The formula method does a nicer job of preserving names but it doesn't let you control the names directly in your command, which you can do in the data.frame method:
```
aggregate(cbind(NewX1 = mydf$X1, NewX2 = mydf$X2, NewX3 = mydf$X3), 
          by = list(NewA = mydf$A, NewB = mydf$B), sum)
```
The formula method and the data.frame method treat NA values differently. To get the same result with the formula method as you do with the data.frame method, you need to use na.action = na.pass.
```
aggregate(. ~ A + B, mydf, sum, na.action=na.pass)
```

Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.

0人赞添加讨论(0) 举报

Difference between tilde and “by” while using aggr

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间