Hi I am new to cluster computing and currently I am only playing around on the standalone cluster (sc <- spark_connect(master = "local", version = '2.0.2')
). I have a massive csv file (15GB) which I would like to convert to a parquet file (Third chunk of code explains why). This 15GB file is already a sample of a 60GB file, and I need to use/query the full 60GB file when I stop playing arround. Currently what I did was:
> system.time({FILE<-spark_read_csv(sc,"FILE",file.path("DATA/FILE.csv"),memory = FALSE)})
user system elapsed
0.16 0.04 1017.11
> system.time({spark_write_parquet(FILE, file.path("DATA/FILE.parquet"),mode='overwrite')})
user system elapsed
0.92 1.48 1267.72
> system.time({FILE<-spark_read_parquet(sc,"FILE", file.path("DATA/FILE.parquet"),memory = FALSE)})
user system elapsed
0.00 0.00 0.26
As you can see this takes quite a long time. I was wondering what happens in the first line of code (spark_read_csv
) with memory = FALSE
? Where does it read/save it to? and can I access that location when I disconnect and reconnect the session again?
Also, is there a way to combine step 1 & 2 in a more efficient way?
I am not shy to try and use lower level functions that aren't available in the API yet given that it is simple and can be automated to a large degree.