I would like to partition an RDD by key and have that each parition contains only values of a single key. For example, if I have 100 different values of the key and I repartition(102)
, the RDD should have 2 empty partitions and 100 partitions containing each one a single key value.
I tried with groupByKey(k).repartition(102)
but this does not guarantee the exclusivity of a key in each partition, since I see some partitions containing more values of a single key and more than 2 empty.
Is there a way in the standard API to do this?
to use partitionBy() RDD must consist of tuple (pair) objects. Lets see an example below:
Suppose I have an Input file with following data:
reading file into RDD and skip header
now Lets re-partition RDD into '5' partitions
lets have a look how data is being distributed in these '5' partitions
here you can see that data is written into two partitions and, three of them are empty and also it's not being distributed uniformly.
We need create a pair RDD in order have the RDD data distributed uniformly across the number of partitions. Lets create a pair RDD and break it into key value pair.
now lets re partition this rdd into '5' partition and distribute the data uniformly into the partitions using key at [0]th position.
now we can see that data is being distributed uniformly according to the matching key value pairs.
below you can verify the number of records in each partitions.
Please note that when you create a pair RDD of key value pair, your key should be of type int else you will get an error.
Hope this helps!
For an RDD, have you tried using partitionBy to partition the RDD by key, like in this question? You can specify the number of partitions to be the number of keys to get rid of the empty partitions if desired.
In the Dataset API, you can use repartition with a
Column
as an argument to partition by the values in that column (although note that this uses the value ofspark.sql.shuffle.partitions
as the number of partitions, so you'll get a lot more empty partitions).