I have two scenarios where I have 23 GB
partitioned parquet
data and reading few of the columns
& caching
it upfront to fire a series of subsequent queries later on.
Setup:
- Cluster: 12 Node EMR
- Spark Version: 1.6
- Spark Configurations: Default
- Run Configurations: Same for both cases
Case 1:
val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase")
dfMain.cache.count
From SparkUI
, the input data read is 6.2 GB and the cached object is of 15.1 GB.
Case 1:
val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase order by pogId")
dfMain.cache.count
From SparkUI
, the input data read is 6.2 GB and the cached object is of 5.5 GB.
Any explanation, or code-reference to this behavior?