I have a data frame with a categorical variable holding lists of strings, with variable length (it is important because otherwise this question would be a duplicate of this or this), e.g.:
df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df
x y 1 1 A 2 2 A, B 3 3 C 4 4 B, D, C 5 5 E
And the desired form is a dummy variable for each unique string seen anywhere in df$y
, i.e.:
data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))
x A B C D E 1 1 1 0 0 0 0 2 2 1 1 0 0 0 3 3 0 0 1 0 0 4 4 0 1 1 1 0 5 5 0 0 0 0 1
This naive approach works:
> uniqueStrings <- unique(unlist(df$y))
> n <- ncol(df)
> for (i in 1:length(uniqueStrings)) {
+ df[, n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0))
+ colnames(df)[n + i] <- uniqueStrings[i]
+ }
However it is very ugly, lazy and slow with big data frames.
Any suggestions? Something fancy from the tidyverse
?
UPDATE: I got 3 different approaches below. I tested them using system.time
on my (Windows 7, 32GB RAM) laptop on a real dataset, comprising of 1M rows, each row containing a list of length 1 to 4 strings (out of ~350 unique string values), overall 200MB on disk. So the expected result is a data frame with dimensions 1M x 350. The tidyverse
(@Sotos) and base
(@joel.wilson) approaches took so long I had to restart R. The qdapTools
(@akrun) approach however worked fantastic:
> system.time(res1 <- mtabulate(varsLists))
user system elapsed
47.05 10.27 116.82
So this is the approach I'll mark accepted.
Another idea,
Further to the cases you mentioned in comments, we can use
dcast
fromreshape2
as it is more flexible thanspread
,this involves no external packages,
We can use
mtabulate