Spark SQL: Cache Memory footprint improves with &#

I have two scenarios where I have 23 GB partitioned parquet data and reading few of the columns & caching it upfront to fire a series of subsequent queries later on.

Setup:

Cluster: 12 Node EMR
Spark Version: 1.6
Spark Configurations: Default
Run Configurations: Same for both cases

Case 1:

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase")
dfMain.cache.count

From SparkUI, the input data read is 6.2 GB and the cached object is of 15.1 GB.

Case 1:

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase order by pogId")
dfMain.cache.count

From SparkUI, the input data read is 6.2 GB and the cached object is of 5.5 GB.

Any explanation, or code-reference to this behavior?

It is actually relatively simple. As you can read in the SQL guide:

Spark SQL can cache tables using an in-memory columnar format ... Spark SQL will scan only required columns and will automatically tune compression

Nice thing about sorted columnar storage is that it is very easy to compress on typical data. When you sort, you get these blocks of the similar records which can be squashed together using even very simple techniques like RLE.

This is a property that is actually used quite often in databases with columnar storage because it is not only very efficient in terms of storage but also aggregations.

Different aspects of the Spark columnar compression are covered by the sql.execution.columnar.compression package and as you can see RunLengthEncoding is indeed one of the available compressions schemes.

So there are two pieces here:

Spark can adjust compression method on the fly based on the statistics:

Spark SQL will automatically select a compression codec for each column based on statistics of the data.
sorting can cluster similar records together making compression much more efficient.

If there are some correlations between columns (when it is not the case?) even a simple sort based on a single column can have relatively large impact and improve the performance of different compression schemes.