One Hot Encoding From Multiple Rows in R

2019-07-22 13:07发布

问题:

Suppose I have data that has the following format:

ID VALUE
a  a
a  b
d  b
d  c

What I would like to do is a one hot-encoding for the ID value. When I use model.matrix, I obtain:

model.matrix(~VALUE-1, df)

ID aVALUE bVALUE cVALUE
a  1      0      0
a  0      1      0
d  0      1      0
d  0      0      1

What I would like to get however is this:

ID aVALUE bVALUE cVALUE
a  1      1      0
d  0      1      1

The other part to this is that my data frame is approximately 30 million rows - so I am looking for an efficient way to do this. Any help or comments would be greatly appreciated!

Thank you!

回答1:

You could use table.

d <- table(df$ID, df$VALUE)
#    a b c
#  a 1 2 0
#  d 0 1 1

If you have to enforce values of 1 or 0 because some combinations show up more than once, then you can convert those cases to 1:

d[d > 1L] <- 1
#    a b c
#  a 1 1 0
#  d 0 1 1

Example data

df <- structure(list(ID = c("a", "a", "a", "d", "d"), VALUE = c("a", "b", "b", "b", "c")),
   .Names = c("ID", "VALUE"), class = "data.frame", row.names = c(NA, -5L))


回答2:

One option is dcast from data.table to convert to 'wide' format from 'long'. Convert the 'data.frame' to 'data.table' (setDT(df)), reshape it to 'wide'format with dcast and specify the fun.aggregate. For big datasets, the dcast approach would be fast.

library(data.table)
dcast(setDT(df), ID~paste0(VALUE, "VALUE"), value.var = "VALUE", 
                function(x) as.integer(length(x) > 0))
#    ID aVALUE bVALUE cVALUE
#1:  a      1      1      0
#2:  d      0      1      1

Another option is dplyr/tidyr

library(dplyr)
library(tidyr)
df %>% 
   unique() %>%
   mutate(n = 1)%>% 
   spread(VALUE, n, fill = 0)