I want to filter the n largest groups based on count, and then do some calculations on the filtered dataframe
Here is some data
Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)
|Brand | Category| Clicks|
|:-----|--------:|------:|
|A | 1| 10|
|B | 2| 11|
|C | 1| 12|
|A | 1| 13|
|A | 2| 14|
|B | 1| 15|
|A | 2| 14|
|A | 1| 13|
|B | 2| 12|
|C | 1| 11|
This is my expected output. I want to filter out the two largest brands by count and then find the mean clicks in each brand / category combination
|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A | 1| 12.0|
|A | 2| 14.0|
|B | 1| 15.0|
|B | 2| 11.5|
Which I thought could be achieved with code like this (but can't)
df %>%
group_by(Brand, Category) %>%
top_n(2, Brand) %>% # Largest 2 brands by count
summarise(mean_clicks = mean(Clicks))
EDIT: the ideal answer should be able to be used on database tables as well as local tables
Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(
How about this approach, using
table
, from base R -A data.table idea is to get the counts grouped by
Brands
and filter the top two (after ordering in descending order). Then we merge with the original data frame and find the mean grouped by(Brand, Category)
which gives,
A different
dplyr
solution:Or a simplified version (based on suggestions from @camille):
Another
dplyr
solution using ajoin
to filter the data frame:EDIT
Based on updated question, we can add a count column first, filter only top
n
group counts, thengroup_by
Brand
andCategory
to find themean
for each group.Original Answer
We can
group_by
Brand
and do all the calculations by group and then filter top groups bytop_n