spark behavior on hive partitioned table

2019-08-07 13:23发布

问题:

I use Spark 2.

Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.

We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.

UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.

回答1:

In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.

The number of Spark partitions when you load a hive-table depends on the parameters:

spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)

You can check the partitions e.g. using

spark.table(yourtable).rdd.partitions

This will give you an Array of FilePartitions which contain the physical path of your files.

Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)



回答2:

I just want to understand how the tasks are generated on data volume.

Tasks are a runtime artifact and their number is exactly the number of partitions.

The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.