One Hot Encoding From Multiple Rows in R

2019-07-22 12:58发布

Suppose I have data that has the following format:

ID VALUE
a  a
a  b
d  b
d  c

What I would like to do is a one hot-encoding for the ID value. When I use model.matrix, I obtain:

model.matrix(~VALUE-1, df)

ID aVALUE bVALUE cVALUE
a  1      0      0
a  0      1      0
d  0      1      0
d  0      0      1

What I would like to get however is this:

ID aVALUE bVALUE cVALUE
a  1      1      0
d  0      1      1

The other part to this is that my data frame is approximately 30 million rows - so I am looking for an efficient way to do this. Any help or comments would be greatly appreciated!

Thank you!

2条回答
太酷不给撩
2楼-- · 2019-07-22 13:42

One option is dcast from data.table to convert to 'wide' format from 'long'. Convert the 'data.frame' to 'data.table' (setDT(df)), reshape it to 'wide'format with dcast and specify the fun.aggregate. For big datasets, the dcast approach would be fast.

library(data.table)
dcast(setDT(df), ID~paste0(VALUE, "VALUE"), value.var = "VALUE", 
                function(x) as.integer(length(x) > 0))
#    ID aVALUE bVALUE cVALUE
#1:  a      1      1      0
#2:  d      0      1      1

Another option is dplyr/tidyr

library(dplyr)
library(tidyr)
df %>% 
   unique() %>%
   mutate(n = 1)%>% 
   spread(VALUE, n, fill = 0)
查看更多
一纸荒年 Trace。
3楼-- · 2019-07-22 13:52

You could use table.

d <- table(df$ID, df$VALUE)
#    a b c
#  a 1 2 0
#  d 0 1 1

If you have to enforce values of 1 or 0 because some combinations show up more than once, then you can convert those cases to 1:

d[d > 1L] <- 1
#    a b c
#  a 1 1 0
#  d 0 1 1

Example data

df <- structure(list(ID = c("a", "a", "a", "d", "d"), VALUE = c("a", "b", "b", "b", "c")),
   .Names = c("ID", "VALUE"), class = "data.frame", row.names = c(NA, -5L))
查看更多
登录 后发表回答