Tidyverse: filtering n largest groups in grouped d

2020-08-18 05:26发布

I want to filter the n largest groups based on count, and then do some calculations on the filtered dataframe

Here is some data

Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)

|Brand | Category| Clicks|
|:-----|--------:|------:|
|A     |        1|     10|
|B     |        2|     11|
|C     |        1|     12|
|A     |        1|     13|
|A     |        2|     14|
|B     |        1|     15|
|A     |        2|     14|
|A     |        1|     13|
|B     |        2|     12|
|C     |        1|     11|

This is my expected output. I want to filter out the two largest brands by count and then find the mean clicks in each brand / category combination

|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A     |        1|        12.0|
|A     |        2|        14.0|
|B     |        1|        15.0|
|B     |        2|        11.5|

Which I thought could be achieved with code like this (but can't)

df %>%
  group_by(Brand, Category) %>%
  top_n(2, Brand) %>% # Largest 2 brands by count
  summarise(mean_clicks = mean(Clicks))

EDIT: the ideal answer should be able to be used on database tables as well as local tables

标签: r dplyr top-n
6条回答
冷血范
2楼-- · 2020-08-18 05:58

Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(

df %>% count(Brand) %>% top_n(2,n) -> Top2
df %>% group_by(Brand, Category) %>% 
filter(Brand %in% Top2$Brand) %>% 
summarise(mean_clicks = mean(Clicks))
remove(Top2)
查看更多
再贱就再见
3楼-- · 2020-08-18 06:02

How about this approach, using table, from base R -

df %>%
  filter(Brand %in% names(tail(sort(table(Brand)), 2))) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <chr>    <dbl>       <dbl>
1 A         1.00        12.0
2 A         2.00        14.0
3 B         1.00        15.0
4 B         2.00        11.5
查看更多
▲ chillily
4楼-- · 2020-08-18 06:09

A idea is to get the counts grouped by Brands and filter the top two (after ordering in descending order). Then we merge with the original data frame and find the mean grouped by (Brand, Category)

library(data.table)

#Convert to data.table
dt1 <- setDT(df)

dt1[dt1[, .(cnt = .N), by = Brand][
             order(cnt, decreasing = TRUE), .SD[1:2]][,cnt := NULL], 
                   on = 'Brand'][, .(means = mean(Clicks)), by = .(Brand, Category)][]

which gives,

   Brand Category means
1:     A        1  12.0
2:     A        2  14.0
3:     B        2  11.5
4:     B        1  15.0
查看更多
Rolldiameter
5楼-- · 2020-08-18 06:11

A different dplyr solution:

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  mutate(rank = dense_rank(desc(n))) %>%
  filter(rank == 1 | rank == 2) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <fct>    <dbl>       <dbl>
1 A           1.        12.0
2 A           2.        14.0
3 B           1.        15.0
4 B           2.        11.5

Or a simplified version (based on suggestions from @camille):

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  filter(dense_rank(desc(n)) < 3) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))
查看更多
聊天终结者
6楼-- · 2020-08-18 06:17

Another dplyr solution using a join to filter the data frame:

library(dplyr)

df %>%
  group_by(Brand) %>%
  summarise(n = n()) %>%
  top_n(2) %>% # select top 2
  left_join(df, by = "Brand") %>% # filters out top 2 Brands
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# # A tibble: 4 x 3
# # Groups:   Brand [?]
#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
# 1 A            1        12  
# 2 A            2        14  
# 3 B            1        15  
# 4 B            2        11.5
查看更多
太酷不给撩
7楼-- · 2020-08-18 06:21

EDIT

Based on updated question, we can add a count column first, filter only top n group counts, then group_by Brand and Category to find the mean for each group.

df %>%
  add_count(Brand, sort = TRUE) %>%
  filter(n %in% head(unique(n), 2)) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))


#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
#1 A            1        12  
#2 A            2        14  
#3 B            1        15  
#4 B            2        11.5

Original Answer

We can group_by Brand and do all the calculations by group and then filter top groups by top_n

library(dplyr)
df %>%
  group_by(Brand) %>%
  summarise(n = n(), 
            mean = mean(Clicks)) %>%
  top_n(2, n) %>%
  select(-n)

#  Brand  mean
#  <fct> <dbl>
#1  A      12.8
#2  B      12.7
查看更多
登录 后发表回答