Why doesn't ml_create_dummy_variables show new

2019-05-23 12:58发布

问题:

I'm trying to create a model matrix in sparklyr. There is a function ml_create_dummy_variables() for creating dummy variables for one categorical variable at a time. As far as I can tell there is no model.matrix() equivalent for creating a model matrix in one step. It's easy to use ml_create_dummy_variables() but I don't understand why the new dummy variables aren't stored in the Spark dataframe.

Consider this example:

    ###create dummy data to figure out how model matrix formulas work in sparklyr

v1 <- sample( LETTERS[1:4], 50000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))
v2 <- sample( LETTERS[5:6], 50000, replace=TRUE, prob=c(0.7,0.3))
v3 <- sample( LETTERS[7:10], 50000, replace=TRUE, prob=c(0.3, 0.2, 0.4, 0.1))
v4 <- sample( LETTERS[11:15], 50000, replace=TRUE, prob=c(0.1, 0.1, 0.3, 0.05,.45))
v5 <- sample( LETTERS[16:17], 50000, replace=TRUE, prob=c(0.4,0.6))
v6 <- sample( LETTERS[18:21], 50000, replace=TRUE, prob=c(0.1, 0.1, 0.65, 0.15))
v7 <- sample( LETTERS[22:26], 50000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.03,.02))
v8 <- rnorm(n=50000,mean=.5,sd=.1)
v9 <- rnorm(n=50000,mean=5,sd=3)
v10 <- rnorm(n=50000,mean=3,sd=.5)
response <- rnorm(n=50000,mean=10,sd=2)

dat <- data.frame(v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,response)
write.csv(dat,file='fake_dat.csv',row.names = FALSE)

#push "fake_dat.csv" to the hdfs

library(dplyr)
library(sparklyr)
#configure the spark session and connect
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "2G" #change depending on the size of the data
config$`sparklyr.shell.executor-memory` <- "2G"

sc <-  spark_connect(master='yarn-client', spark_home='/usr/hdp/2.5.0.0-1245/spark',config = config)
sc

#can also set spark_home as ‘/usr/hdp/current/spark-client’

#read in the data from the hdfs
df <- spark_read_csv(sc,name='fdat',path='hdfs://pnhadoop/user/stc004/fake_dat.csv')

#create spark table
dat <- tbl(sc,'fdat')

#create dummy variables
ml_create_dummy_variables(x=dat,'v1', reference = NULL)

Now I get the following notification from sparklyr:

Source:   query [5e+04 x 15]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

      v1    v2    v3    v4    v5    v6    v7        v8        v9      v10
   <chr> <chr> <chr> <chr> <chr> <chr> <chr>     <dbl>     <dbl>    <dbl>
1      A     F     I     O     Q     T     X 0.4518162 12.281566 3.915094
2      C     E     H     L     Q     T     X 0.3967605  2.131341 3.373347
3      C     F     I     O     P     S     W 0.4458047  7.167670 2.737003
4      C     E     G     M     P     T     X 0.4822457  5.946978 2.375309
5      B     E     H     L     P     U     W 0.4756011  9.456327 2.406996
6      C     F     H     L     P     U     X 0.5064916  2.920591 3.111827
7      C     F     I     O     Q     T     W 0.3060585  1.611517 2.242328
8      B     F     J     L     Q     T     V 0.6238052  9.821750 2.670400
9      C     E     I     O     Q     U     X 0.4249922  2.141794 3.020958
10     B     F     G     K     P     T     X 0.5348334  1.461034 3.057635
# ... with 4.999e+04 more rows, and 5 more variables: response <dbl>,
#   v1_A <dbl>, v1_B <dbl>, v1_C <dbl>, v1_D <dbl>

When I check the number of columns the new dummy variables don't appear.

> colnames(dat)
 [1] "v1"       "v2"       "v3"       "v4"       "v5"       "v6"
 [7] "v7"       "v8"       "v9"       "v10"      "response"
>

Why is that happening? Also, is there an easy way to convert all columns in one step? I work with datasets of >1000 variables so I need a quick way to do this. I've tried creating a loop, but that doesn't do anything:

for(i in 1:7){
ml_create_dummy_variables(x=dat,colnames(dat)[i],reference=NULL)
}

回答1:

ml_create_dummy_variables doesn't modify existing table but create and your code simply discards the results. You have to store the results:

tmp <- ml_create_dummy_variables(x=dat,'v1', reference = NULL)

Also, is there an easy way to convert all columns in one step? I work with datasets of >1000 variables so I need a quick way to do this

Loop or Reduce is just fine but there is no quick way to do it. To create dummies you have to [determine all possible levels first and this requires a full column scan for each variable.

Furthermore with > 1000 columns, especially with large number of levels you start to hit different limitations of the Spark optimizer. sparklyr (unlike Spark ML which uses Vector UDT) expands all columns and this doesn't scale well.