Spark Predicate Push Down, Filtering and Partition

2020-07-26 11:51发布

问题:

I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read. I had the following doubts related to the same

Suppose I have a dataset with columns (Year: Int, SchoolName: String, StudentId: Int, SubjectEnrolled: String) of which the data stored on disk is partitioned by Year and SchoolName and stored in parquet format at say azure data lake storage.

1) If I issue a read spark.read(container).filter(Year=2019, SchoolName="XYZ"):

  • Will Partition Pruning come in effect and only a limited number of partitions will be read?
  • Will there be I/O on blob store and data will be loaded to the Spark cluster and then filtered i.e. will I have to pay azure for the IO of all other data that we don't need?
  • If not how does azure blob file system understand these filters since it is not queryable by default?

2) If I issue a read spark.read(container).filter(StudentId = 43) :

  • Will spark push the filter to disk still and only read the data that is required? Since I didn't partition by this, will it understand every row and filter according to the query?
  • Again will I have to pay for IO to azure for all the files that were not required according to the query?

回答1:

1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. If you look at your file structure it's stored as something like:

parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
parquet-folder/Year=2019/SchoolName=XYZ/...

2) When you filter on some column that isn't in your partition, Spark will scan every part file in every folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read.