I currently have a dataframe with 260,000 rows and 50 columns where 3 columns are numeric and the rest are categorical. I wanted to one hot encode the categorical columns in order to perform PCA and use regression to predict the class. How can I go about accomplishing the below example in R?
Example:
V1 V2 V3 V4 V5 .... VN-1 VN
to
V1_a V1_b V2_a V2_b V2_c V3_a V3_b and so on
Basically a oneliner with
data.table
andmltools
:Data
You can use
model.matrix
orsparse.model.matrix
. Something like this:sparse.model.matrix(~. -1, data = your_data)
The
~.
tells R that your entire table (the.
) is the right hand side of some hypothetical model, and the-1
says to leave out the intercept. Without the-1
your first column will be a vector of 1s.Don't really what you mean by "hot encode".
Here's an example of using dplyr to spread out the catagorical variable iris$Species into three separate columns: