Suppose I have data that has the following format:
ID VALUE
a a
a b
d b
d c
What I would like to do is a one hot-encoding for the ID value. When I use model.matrix
, I obtain:
model.matrix(~VALUE-1, df)
ID aVALUE bVALUE cVALUE
a 1 0 0
a 0 1 0
d 0 1 0
d 0 0 1
What I would like to get however is this:
ID aVALUE bVALUE cVALUE
a 1 1 0
d 0 1 1
The other part to this is that my data frame is approximately 30 million rows - so I am looking for an efficient way to do this. Any help or comments would be greatly appreciated!
Thank you!
You could use table
.
d <- table(df$ID, df$VALUE)
# a b c
# a 1 2 0
# d 0 1 1
If you have to enforce values of 1 or 0 because some combinations show up more than once, then you can convert those cases to 1:
d[d > 1L] <- 1
# a b c
# a 1 1 0
# d 0 1 1
Example data
df <- structure(list(ID = c("a", "a", "a", "d", "d"), VALUE = c("a", "b", "b", "b", "c")),
.Names = c("ID", "VALUE"), class = "data.frame", row.names = c(NA, -5L))
One option is dcast
from data.table
to convert to 'wide' format from 'long'. Convert the 'data.frame' to 'data.table' (setDT(df)
), reshape it to 'wide'format with dcast
and specify the fun.aggregate
. For big datasets, the dcast
approach would be fast.
library(data.table)
dcast(setDT(df), ID~paste0(VALUE, "VALUE"), value.var = "VALUE",
function(x) as.integer(length(x) > 0))
# ID aVALUE bVALUE cVALUE
#1: a 1 1 0
#2: d 0 1 1
Another option is dplyr/tidyr
library(dplyr)
library(tidyr)
df %>%
unique() %>%
mutate(n = 1)%>%
spread(VALUE, n, fill = 0)