Suppose I have data that has the following format:
ID VALUE
a a
a b
d b
d c
What I would like to do is a one hot-encoding for the ID value. When I use model.matrix
, I obtain:
model.matrix(~VALUE-1, df)
ID aVALUE bVALUE cVALUE
a 1 0 0
a 0 1 0
d 0 1 0
d 0 0 1
What I would like to get however is this:
ID aVALUE bVALUE cVALUE
a 1 1 0
d 0 1 1
The other part to this is that my data frame is approximately 30 million rows - so I am looking for an efficient way to do this. Any help or comments would be greatly appreciated!
Thank you!
One option is
dcast
fromdata.table
to convert to 'wide' format from 'long'. Convert the 'data.frame' to 'data.table' (setDT(df)
), reshape it to 'wide'format withdcast
and specify thefun.aggregate
. For big datasets, thedcast
approach would be fast.Another option is
dplyr/tidyr
You could use
table
.If you have to enforce values of 1 or 0 because some combinations show up more than once, then you can convert those cases to 1:
Example data