I have a table with two columns namely id
and item
:
df <- data.frame(id=c(1,1,2,2,2,2,3,3,3,4,4,4,4,4),item=c(1,2,3,1,2,3,4,1,2,3,1,2,1,2))
I want to find the most frequent combination (order doesn't matter) of 3 items per id
. So basically, n
choose r
where n = number of items within id
and r = 3
. The number of items per id
varies - some have more than 3, some have less.
I am new to R and read about combn
and expand.grid
, but I don't know how to use them in my case (to work within each id
).
"Find most frequent combination of values in a data.frame" is the closest question I found.
EDIT: The expected answer based on the example is the combination "1, 2, 3", which appears in id 2 and 4.
Here is one idea using dplyr
df1 <- df %>%
group_by(id) %>%
arrange(item) %>%
summarise(new = paste(unique(combn(item, length(unique(item)), toString)), collapse = '/'))
df1
# A tibble: 4 × 2
# id new
# <dbl> <chr>
#1 1 1, 2
#2 2 1, 2, 3 / 1, 3, 3 / 2, 3, 3
#3 3 1, 2, 4
#4 4 1, 1, 2 / 1, 1, 3 / 1, 2, 2 / 1, 2, 3 / 2, 2, 3
names(sort(table(unlist(strsplit(df1$new, '/'))), decreasing = TRUE)[1])
#[1] "1, 2, 3"
library(dplyr)
grouped <- df %>% group_by(id,item) %>% summarize(count = n()) %>% arrange(desc(count))
Voila. The highest counts sorted from highest to lowest.
EDIT: Just realized I didn't fully answer your question. I hope I gave you a good start.
I think this is what you want using base R (no package needed):
a <- aggregate(item~id, df, unique)
a <- lapply(a$item, 'length<-', max(lengths(a$item)))
m <- matrix(unlist(a), ncol=3, byrow = T)
m <- t(apply(m,1,function(x) sort(x,na.last = T)))
# [,1] [,2] [,3]
#[1,] 1 2 NA
#[2,] 1 2 3
#[3,] 1 2 4
#[4,] 1 2 3
Once we get matrix m
, the most frequent row of the matrix is what you want:
t <- table(apply(m, 1, paste, collapse = "/"))
as.numeric(strsplit(names(which.max(t)), "/")[[1]])
#[1] 1 2 3