I have a text file consisting of a large number of random floating values separated by spaces. I am loading this file into a RDD in scala. How does this RDD get partitioned?
Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition?
val dRDD = sc.textFile("hdfs://master:54310/Data/input*")
keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r))
Here I am loading multiple text files from HDFS and process is a function I am calling. Can I have a solution with mapPartitonsWithIndex along with how can I access that index inside the process function? Map shuffles the partitions.
You can generate custom partitions using the coalesce function:
By default a partition is created for each HDFS partition, which by default is 64MB. Read more here.
First, take a look at the three ways one can repartition his data:
1) Pass a second parameter, the desired minimum number of partitions for your RDD, into textFile(), but be careful:
As you can see,
[16]
doesn't do what one would expect, since the number of partitions the RDD has, is already greater than the minimum number of partitions we request.2) Use repartition(), like this:
Warning: This will invoke a shuffle and should be used when you want to increase the number of partitions your RDD has.
From the docs:
3) Use coalesce(), like this:
Here, Spark knows that you will shrink the RDD and gets advantage of it. Read more about repartition() vs coalesce().
But will all this guarantee that your data will be perfectly balanced across your partitions? Not really, as I experienced in How to balance my data across the partitions?
The loaded rdd is partitioned by default partitioner: hash code. To specify custom partitioner, use can check rdd.partitionBy(), provided with your own partitioner.
I don't think it's ok to use coalesce() here, as by api docs, coalesce() can only be used when we reduce number of partitions, and even we can't specify a custom partitioner with coalesce().