SparklyR: Convert directly to parquet

2019-07-14 23:13发布

问题:

Hi I am new to cluster computing and currently I am only playing around on the standalone cluster (sc <- spark_connect(master = "local", version = '2.0.2')). I have a massive csv file (15GB) which I would like to convert to a parquet file (Third chunk of code explains why). This 15GB file is already a sample of a 60GB file, and I need to use/query the full 60GB file when I stop playing arround. Currently what I did was:

> system.time({FILE<-spark_read_csv(sc,"FILE",file.path("DATA/FILE.csv"),memory = FALSE)})
   user  system elapsed 
   0.16    0.04 1017.11 
> system.time({spark_write_parquet(FILE, file.path("DATA/FILE.parquet"),mode='overwrite')})
   user  system elapsed 
   0.92    1.48 1267.72 
> system.time({FILE<-spark_read_parquet(sc,"FILE", file.path("DATA/FILE.parquet"),memory = FALSE)})
   user  system elapsed 
   0.00    0.00    0.26 

As you can see this takes quite a long time. I was wondering what happens in the first line of code (spark_read_csv) with memory = FALSE ? Where does it read/save it to? and can I access that location when I disconnect and reconnect the session again?

Also, is there a way to combine step 1 & 2 in a more efficient way?

I am not shy to try and use lower level functions that aren't available in the API yet given that it is simple and can be automated to a large degree.

回答1:

No data is saved when spark_read_csv is invoked with memory = FALSE. The delay you is related not to data loading as such, but to schema inference process, which requires a separate data scan.

As convenient as it is to use schema inference, it is much better performance-wise to provide schema explicitly, as named vector, mapping from column names to to type's simple string. For example if you were to load iris dataset in a local mode:

path <- tempfile()
readr::write_csv(iris, path)

you'd use

spark_read_csv(
  sc, "iris", path, infer_schema=FALSE, memory = FALSE,
  columns = c(
    Sepal_Length = "double", Sepal_Width = "double", 
    Petal_Length = "double", Petal_Width = "double",
    Species = "string"))