I'm trying to use a sparse matrix to generate dummy variables for a set of data with 5.8 million rows and two categorical columns.
The structure of the data is:
mydata: data.table of 5,800,000 rows and two categorical (in integer format) variables Var1 and Var2
nlevel(Var1) : 210,000 (levels include all numbers between 1 and 210,000)
nlevel(Var2) : 500 (levels include all numbers between 1 and 500)
here's an example of mydata:
Var_1 Var_2
1 4
1 2
2 7
5 9
5 500
.
.
.
200 6
200 2
200 80
.
.
.
I'm using a sparse Matrix (sparse_Mx) to create the dummy variable matrix which would be of the form:
Var_1 Var_2_level_1 Var_2_level_2 . . . Var_2_level_500
1 0 1 0
2 0 0 0
3 1 1 0
4 0 0 0
5 0 0 1
.
.
.
200 0 1 0
.
.
.
210,000 ... ... ...
I didn't know how to do this efficiently, so i used a for-loop to create the dummy variable matrix:
library(Matrix) #for sparse matrices
m2 <- Matrix(0, nrow = 210000, ncol = 500 , sparse = TRUE)
for (i in 1: nrow(mydata))
sparse_Mx[ mydata[i, Var_1] , mydata[i, Var_2] ] <- 1
It basically goes through each row of mydata, and based on the row Var1 value (which determines the row in the matrix) and the row Var2 value (which determines the column number in the matrix, fills the sparse matrix with 1.
It works, except it's taking forever (as the for-loop has to go through 5,800,000 loops!)
Is there any way to do this more efficiently? I really dislike using for-loop for this purpose but couldn't think of another way to do this.
Edit: I'd like to add that I have tried using sparse.model.matrix(), to no avail. the generated matrix is not in the right format (210,000 rows and 500 columns).
The variables were converted to factors and used the following:
sp_mx <- sparse.model.matrix( ~ . -1 , data = mydata)
However, I get a sparse matrix of [5,800,000 x 500 ] as opposed to a matrix of [210,000 x 500]
I've tried many variations and still same result:
sp_mx <- sparse.model.matrix( ~ Var2 -1 , data = mydata)
or
sp_mx <- sparse.model.matrix(Var1 ~ Var2 -1 , data = mydata)
all of them result in a sparse matrix with all rows. what i need is a [210,000 x 500] matrix that will have more than one 1 in each row.