Group by multiple columns in dplyr, using string v

2019-01-01 14:15发布

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.

# make data with weird column names that can't be hard coded
data = data.frame(
  asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

# plyr - works
ddply(data, columns, summarize, value=mean(value))

# dplyr - raises error
data %.%
  group_by(columns) %.%
  summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds

What am I missing to translate the plyr example into a dplyr-esque syntax?

Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.

标签: r dplyr
9条回答
伤终究还是伤i
2楼-- · 2019-01-01 14:45

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
    asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
  group_by_at(vars(one_of(columns))) %>%
  summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE 
##  27 

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups:   asihckhdoydkhxiydfgfTgdsx [?]
  asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja       Value
                     <fctr>                    <fctr>       <dbl>
1                         A                         A  0.04095002
2                         A                         B  0.24943935
3                         A                         C -0.25783892
4                         B                         A  0.15161805
5                         B                         B  0.27189974
6                         B                         C  0.20858897
7                         C                         A  0.19502221
8                         C                         B  0.56837548
9                         C                         C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

查看更多
永恒的永恒
3楼-- · 2019-01-01 14:45
data = data.frame(
  my.a = sample(LETTERS[1:3], 100, replace=TRUE),
  my.b = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
查看更多
爱死公子算了
4楼-- · 2019-01-01 14:49

Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:

library(dplyr)

df <-  data.frame(
    asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# Columns you want to group by
grp_cols <- names(df)[-3]

# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)

# Perform frequency counts
df %>%
    group_by_(.dots=dots) %>%
    summarise(n = n())

output:

Source: local data frame [9 x 3]
Groups: asihckhdoydk

  asihckhdoydk a30mvxigxkgh  n
1            A            A 10
2            A            B 10
3            A            C 13
4            B            A 14
5            B            B 10
6            B            C 12
7            C            A  9
8            C            B 12
9            C            C 10
查看更多
谁念西风独自凉
5楼-- · 2019-01-01 14:49

Until dplyr has full support for string arguments, perhaps this gist is useful:

https://gist.github.com/skranz/9681509

It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example

cols = c("cyl","gear")
mtcars %.%
  s_group_by(cols) %.%  
  s_summarise("avdisp=mean(disp), max(disp)") %.%
  arrange(avdisp)
查看更多
明月照影归
6楼-- · 2019-01-01 14:50

One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:

library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>% 
  # 1. create quantized versions of base variables
  mutate_each(
    funs(Quantized = . > 0)
  ) %>% 
  # 2. group_by the indicator variables
  group_by_(
    .dots = grep("Quantized", names(.), value = TRUE)
    ) %>% 
  # 3. summarize the base variables
  summarize_each(
    funs(sum(., na.rm = TRUE)), contains("X_")
  )

This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.

查看更多
回忆,回不去的记忆
7楼-- · 2019-01-01 14:53

General example on using the .dots argument as character vector input to the dplyr::group_by function :

iris %>% 
    group_by(.dots ="Species") %>% 
    summarise(meanpetallength = mean(Petal.Length))

Or without a hard coded name for the grouping variable (as asked by the OP):

iris %>% 
    group_by(.dots = names(iris)[5]) %>% 
    summarise_at("Petal.Length", mean)

With the example of the OP:

data %>% 
    group_by(.dots =names(data)[-3]) %>% 
    summarise_at("value", mean)

See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.

查看更多
登录 后发表回答