For the following join between two DataFrames
in Spark 1.6.0
val df0Rep = df0.repartition(32, col("a")).cache
val df1Rep = df1.repartition(32, col("a")).cache
val dfJoin = df0Rep.join(df1Rep, "a")
println(dfJoin.count)
Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you.
[https://medium.com/@achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c]
According to the article link provided above Sort-Merge join is the default join, would like to add important point
For Ideal performance of Sort-Merge join, it is important that all
rows having the same value for the join key are available in the same
partition. This warrants for the infamous partition exchange(shuffle)
between executors. Collocated partitions can avoid unnecessary data
shuffle. Data needs to be evenly distributed n the join keys. The
number of join keys is unique enough so that they can be equally
distributed across the cluster to achieve the max parallelism from the
available partitions