In my case of work I have 2 large DataFrames to be merged based on a association key. Using join
takes a longer time to complete the task.
I see that using cogroup
in preferable than Joins in Apache Spark. Can anyone point on how to use cogroup
on DataFrames or suggest a better approach for merging 2 large FataFrames.
Thank you
DataFrame
doesn't provide any equivalent of cogroup
function and complex objects are not the first class citizens in the Spark SQL. A set of operations available on complex structures is rather limited so typically you have to either create custom expression what is not trivial or use UDFs and pay a performance penalty. Moreover Spark SQL doesn't use the same join
logic as plain RDDs
.
Regarding RDDs. While there exist border cases where cogroup
can be favorable over join
but typically it shouldn't be the case unless the results -> Cartesian product of complete dataset. After all joins on RDDs are expressed using cogroup
followed by flatMapValues
and since the latter operation is local the only real overhead is creation of the output tuples.
If your tables contain only primitive types you could mimic co-group like behavior by aggregating columns with collect_list
first but I wouldn't expect any performance gains here.