I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read. I had the following doubts related to the same
Suppose I have a dataset with columns (Year: Int, SchoolName: String, StudentId: Int, SubjectEnrolled: String) of which the data stored on disk is partitioned by Year and SchoolName and stored in parquet format at say azure data lake storage.
1) If I issue a read spark.read(container).filter(Year=2019, SchoolName="XYZ"):
- Will Partition Pruning come in effect and only a limited number of partitions will be read?
- Will there be I/O on blob store and data will be loaded to the Spark cluster and then filtered i.e. will I have to pay azure for the IO of all other data that we don't need?
- If not how does azure blob file system understand these filters since it is not queryable by default?
2) If I issue a read spark.read(container).filter(StudentId = 43) :
- Will spark push the filter to disk still and only read the data that is required? Since I didn't partition by this, will it understand every row and filter according to the query?
- Again will I have to pay for IO to azure for all the files that were not required according to the query?