可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Suppose I have data that has the following format:

ID VALUE
a  a
a  b
d  b
d  c

What I would like to do is a one hot-encoding for the ID value. When I use model.matrix, I obtain:

model.matrix(~VALUE-1, df)

ID aVALUE bVALUE cVALUE
a  1      0      0
a  0      1      0
d  0      1      0
d  0      0      1

What I would like to get however is this:

ID aVALUE bVALUE cVALUE
a  1      1      0
d  0      1      1

The other part to this is that my data frame is approximately 30 million rows - so I am looking for an efficient way to do this. Any help or comments would be greatly appreciated!

Thank you!

回答1:

You could use table.

d <- table(df$ID, df$VALUE)
#    a b c
#  a 1 2 0
#  d 0 1 1

If you have to enforce values of 1 or 0 because some combinations show up more than once, then you can convert those cases to 1:

d[d > 1L] <- 1
#    a b c
#  a 1 1 0
#  d 0 1 1

Example data

df <- structure(list(ID = c("a", "a", "a", "d", "d"), VALUE = c("a", "b", "b", "b", "c")),
   .Names = c("ID", "VALUE"), class = "data.frame", row.names = c(NA, -5L))

回答2:

One option is dcast from data.table to convert to 'wide' format from 'long'. Convert the 'data.frame' to 'data.table' (setDT(df)), reshape it to 'wide'format with dcast and specify the fun.aggregate. For big datasets, the dcast approach would be fast.

library(data.table)
dcast(setDT(df), ID~paste0(VALUE, "VALUE"), value.var = "VALUE", 
                function(x) as.integer(length(x) > 0))
#    ID aVALUE bVALUE cVALUE
#1:  a      1      1      0
#2:  d      0      1      1

Another option is dplyr/tidyr

library(dplyr)
library(tidyr)
df %>% 
   unique() %>%
   mutate(n = 1)%>% 
   spread(VALUE, n, fill = 0)

One Hot Encoding From Multiple Rows in R

问题:

回答1:

回答2:

收藏的人(0)

One Hot Encoding From Multiple Rows in R

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮