I'm trying to create a model matrix in sparklyr. There is a function ml_create_dummy_variables()
for creating dummy variables for one categorical variable at a time. As far as I can tell there is no model.matrix() equivalent for creating a model matrix in one step. It's easy to use ml_create_dummy_variables()
but I don't understand why the new dummy variables aren't stored in the Spark dataframe.
Consider this example:
###create dummy data to figure out how model matrix formulas work in sparklyr
v1 <- sample( LETTERS[1:4], 50000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))
v2 <- sample( LETTERS[5:6], 50000, replace=TRUE, prob=c(0.7,0.3))
v3 <- sample( LETTERS[7:10], 50000, replace=TRUE, prob=c(0.3, 0.2, 0.4, 0.1))
v4 <- sample( LETTERS[11:15], 50000, replace=TRUE, prob=c(0.1, 0.1, 0.3, 0.05,.45))
v5 <- sample( LETTERS[16:17], 50000, replace=TRUE, prob=c(0.4,0.6))
v6 <- sample( LETTERS[18:21], 50000, replace=TRUE, prob=c(0.1, 0.1, 0.65, 0.15))
v7 <- sample( LETTERS[22:26], 50000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.03,.02))
v8 <- rnorm(n=50000,mean=.5,sd=.1)
v9 <- rnorm(n=50000,mean=5,sd=3)
v10 <- rnorm(n=50000,mean=3,sd=.5)
response <- rnorm(n=50000,mean=10,sd=2)
dat <- data.frame(v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,response)
write.csv(dat,file='fake_dat.csv',row.names = FALSE)
#push "fake_dat.csv" to the hdfs
library(dplyr)
library(sparklyr)
#configure the spark session and connect
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "2G" #change depending on the size of the data
config$`sparklyr.shell.executor-memory` <- "2G"
sc <- spark_connect(master='yarn-client', spark_home='/usr/hdp/2.5.0.0-1245/spark',config = config)
sc
#can also set spark_home as ‘/usr/hdp/current/spark-client’
#read in the data from the hdfs
df <- spark_read_csv(sc,name='fdat',path='hdfs://pnhadoop/user/stc004/fake_dat.csv')
#create spark table
dat <- tbl(sc,'fdat')
#create dummy variables
ml_create_dummy_variables(x=dat,'v1', reference = NULL)
Now I get the following notification from sparklyr:
Source: query [5e+04 x 15]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 A F I O Q T X 0.4518162 12.281566 3.915094
2 C E H L Q T X 0.3967605 2.131341 3.373347
3 C F I O P S W 0.4458047 7.167670 2.737003
4 C E G M P T X 0.4822457 5.946978 2.375309
5 B E H L P U W 0.4756011 9.456327 2.406996
6 C F H L P U X 0.5064916 2.920591 3.111827
7 C F I O Q T W 0.3060585 1.611517 2.242328
8 B F J L Q T V 0.6238052 9.821750 2.670400
9 C E I O Q U X 0.4249922 2.141794 3.020958
10 B F G K P T X 0.5348334 1.461034 3.057635
# ... with 4.999e+04 more rows, and 5 more variables: response <dbl>,
# v1_A <dbl>, v1_B <dbl>, v1_C <dbl>, v1_D <dbl>
When I check the number of columns the new dummy variables don't appear.
> colnames(dat)
[1] "v1" "v2" "v3" "v4" "v5" "v6"
[7] "v7" "v8" "v9" "v10" "response"
>
Why is that happening? Also, is there an easy way to convert all columns in one step? I work with datasets of >1000 variables so I need a quick way to do this. I've tried creating a loop, but that doesn't do anything:
for(i in 1:7){
ml_create_dummy_variables(x=dat,colnames(dat)[i],reference=NULL)
}
ml_create_dummy_variables
doesn't modify existing table but create and your code simply discards the results. You have to store the results:Loop or
Reduce
is just fine but there is no quick way to do it. To create dummies you have to [determine all possible levels first and this requires a full column scan for each variable.Furthermore with > 1000 columns, especially with large number of levels you start to hit different limitations of the Spark optimizer.
sparklyr
(unlike Spark ML which usesVector
UDT) expands all columns and this doesn't scale well.