When to use “Do” function in dplyr

2020-02-17 09:21发布

问题:

I've learned that Do function is used when you want to apply a function to each group.

for example, if I want to pull top 2 rows from "A", "C", and "I" categories of variable Index, following syntax can be used.

t <- mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(.,2))

I understand that after grouping by index, do function is used to compute head(.,2) for each group.

However, on some occasions, do is not used at all. For example, To compute mean of variable Y2014 grouped by variable Index, I thought that following code should be used.

t <- mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014)))

however, above syntax returns error

Error in mean(Y2014) : object 'Y2014' not found

But if I remove do from the syntax, it returns what I exactly wanted.

t <- mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014))

I'm really confused about usage of do function in dplyr. It seems inconsistent to me. When should I use and not use do function? Why should I use do in the first case and not in the second case?

回答1:

The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of do and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives:

Differences between using do and not using it

Within the context of data frames, the key differences between using do and not using do are:

  1. No automatic insertion of dot The code within the do will not have dot automatically inserted into the first argument. For example, instead of the do(summarise(Mean_2014 = mean(Y2014))) code in the question one would have to write do(summarise(., Mean_2014 = mean(Y2014))) with a dot since the dot is not automatically inserted. This is a consequence of do being the right hand side function of %>% rather than summarize. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect: whatever %>% { myfun(arg1, arg2) } would also not automatically insert dot as the first argument of the myfun call.

  2. respecting group_by Only functions specifically written to respect group_by will do so. There are two issues here. (1) Only functions specifically written to respect group_by will be run once for each group. mutate, summarize and do are examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) if do is not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoring group_by. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything about group_by. On the other hand (ii) within do dot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case where do is used and all 6 rows in the second where it is not. This is despite the fact that summarize respects group_by in that it runs once per group.

    BOD$g <- c(1, 1, 1, 2, 2, 2)
    BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.)))
    ## # A tibble: 2 x 2
    ## # Groups: g [2]
    ##       g    nr
    ##   <dbl> <int>
    ## 1  1.00     3
    ## 2  2.00     3
    
    BOD %>% group_by(g) %>% summarize(nr = nrow(.))
    ## # A tibble: 2 x 2
    ##       g    nr
    ##   <dbl> <int>
    ## 1  1.00     6
    ## 2  2.00     6
    

See ?do for more information.

Code from Question

Now we go through the code in the question. As mydata was never defined in the question we use the first line of code below to define it to facilitate concrete examples.

mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1)

mydata %>% 
       filter(Index %in% c("A", "C", "I")) %>% 
       group_by(Index) %>% 
       do(head(., 2))

## # A tibble: 6 x 2
## # Groups: Index [3]
##   Index  Y2014
##   <fctr> <dbl>
## 1 A       1.00
## 2 A       1.00
## 3 C       1.00
## 4 C       1.00
## 5 I       1.00
## 6 I       1.00

The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted do then it would disregard group_by and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to head that avoids these problems but for sake of illustrating the general point we stick to the code in the question.)

The following code from the question generates an error because dot insertion is not done within do and so what ought to be the first argument of summarize, i.e. the data frame input, is missing:

mydata %>% 
       group_by(Index) %>% 
       do(summarise(Mean_2014 = mean(Y2014)))
## Error in mean(Y2014) : object 'Y2014' not found

If we remove the do in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot do(summarise(., Mean_2014 = mean(Y2014))) it would also work although do really seems superfluous in this case as summarize already respects group_by so there is no need to wrap it in do.

mydata %>% 
       group_by(Index) %>% 
       summarise(Mean_2014 = mean(Y2014))

## # A tibble: 3 x 2
##   Index  Mean_2014
##   <fctr>     <dbl>
## 1 A           1.00
## 2 C           1.00
## 3 I           1.00


标签: r dplyr