I want to filter the n largest groups based on count, and then do some calculations on the filtered dataframe

Here is some data

Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)

|Brand | Category| Clicks|
|:-----|--------:|------:|
|A     |        1|     10|
|B     |        2|     11|
|C     |        1|     12|
|A     |        1|     13|
|A     |        2|     14|
|B     |        1|     15|
|A     |        2|     14|
|A     |        1|     13|
|B     |        2|     12|
|C     |        1|     11|

This is my expected output. I want to filter out the two largest brands by count and then find the mean clicks in each brand / category combination

|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A     |        1|        12.0|
|A     |        2|        14.0|
|B     |        1|        15.0|
|B     |        2|        11.5|

Which I thought could be achieved with code like this (but can't)

df %>%
  group_by(Brand, Category) %>%
  top_n(2, Brand) %>% # Largest 2 brands by count
  summarise(mean_clicks = mean(Clicks))

EDIT: the ideal answer should be able to be used on database tables as well as local tables

标签： r dplyr top-n

6条回答

冷血范

2楼-- · 2020-08-18 05:58

Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(

df %>% count(Brand) %>% top_n(2,n) -> Top2
df %>% group_by(Brand, Category) %>% 
filter(Brand %in% Top2$Brand) %>% 
summarise(mean_clicks = mean(Clicks))
remove(Top2)

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2020-08-18 06:02

How about this approach, using table, from base R -

df %>%
  filter(Brand %in% names(tail(sort(table(Brand)), 2))) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <chr>    <dbl>       <dbl>
1 A         1.00        12.0
2 A         2.00        14.0
3 B         1.00        15.0
4 B         2.00        11.5

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2020-08-18 06:09

A data.table idea is to get the counts grouped by Brands and filter the top two (after ordering in descending order). Then we merge with the original data frame and find the mean grouped by (Brand, Category)

library(data.table)

#Convert to data.table
dt1 <- setDT(df)

dt1[dt1[, .(cnt = .N), by = Brand][
             order(cnt, decreasing = TRUE), .SD[1:2]][,cnt := NULL], 
                   on = 'Brand'][, .(means = mean(Clicks)), by = .(Brand, Category)][]

which gives,

   Brand Category means
1:     A        1  12.0
2:     A        2  14.0
3:     B        2  11.5
4:     B        1  15.0

0人赞添加讨论(0) 举报

Rolldiameter

5楼-- · 2020-08-18 06:11

A different dplyr solution:

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  mutate(rank = dense_rank(desc(n))) %>%
  filter(rank == 1 | rank == 2) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <fct>    <dbl>       <dbl>
1 A           1.        12.0
2 A           2.        14.0
3 B           1.        15.0
4 B           2.        11.5

Or a simplified version (based on suggestions from @camille):

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  filter(dense_rank(desc(n)) < 3) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

0人赞添加讨论(0) 举报

聊天终结者

6楼-- · 2020-08-18 06:17

Another dplyr solution using a join to filter the data frame:

library(dplyr)

df %>%
  group_by(Brand) %>%
  summarise(n = n()) %>%
  top_n(2) %>% # select top 2
  left_join(df, by = "Brand") %>% # filters out top 2 Brands
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# # A tibble: 4 x 3
# # Groups:   Brand [?]
#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
# 1 A            1        12  
# 2 A            2        14  
# 3 B            1        15  
# 4 B            2        11.5

0人赞添加讨论(0) 举报

太酷不给撩

7楼-- · 2020-08-18 06:21

EDIT

Based on updated question, we can add a count column first, filter only top n group counts, then group_by Brand and Category to find the mean for each group.

df %>%
  add_count(Brand, sort = TRUE) %>%
  filter(n %in% head(unique(n), 2)) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))


#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
#1 A            1        12  
#2 A            2        14  
#3 B            1        15  
#4 B            2        11.5

Original Answer

We can group_by Brand and do all the calculations by group and then filter top groups by top_n

library(dplyr)
df %>%
  group_by(Brand) %>%
  summarise(n = n(), 
            mean = mean(Clicks)) %>%
  top_n(2, n) %>%
  select(-n)

#  Brand  mean
#  <fct> <dbl>
#1  A      12.8
#2  B      12.7

0人赞添加讨论(0) 举报

Tidyverse: filtering n largest groups in grouped d

Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间