Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).
Further explanation might be helpful since it might involve understanding on distributed file system.
Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).
Further explanation might be helpful since it might involve understanding on distributed file system.
Here's the keys I'd recommend for s3a work
Yes. Filter pushdown does not depend on the underlying file system. It only depends on the
spark.sql.parquet.filterPushdown
and the type of filter (not all filters can be pushed down).See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L313 for the pushdown logic.
I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .
Results:
I will add more details about the tests and results when I have time.
Recently I tried this with Spark 2.4 and seems like Pushdown predicate works with s3.
This is the spark sql query:
And here is the part of output:
Which clearly stats that PushedFilters is not empty.
Note: The used table was created on top of AWS S3
Spark uses the HDFS parquet & s3 libraries so the same logic works. (and in spark 1.6 they've added even a faster shortcut for flat schema parquet files)