Cogroup on Spark DataFrames

2019-07-31 16:06发布

问题:

In my case of work I have 2 large DataFrames to be merged based on a association key. Using join takes a longer time to complete the task.

I see that using cogroup in preferable than Joins in Apache Spark. Can anyone point on how to use cogroup on DataFrames or suggest a better approach for merging 2 large FataFrames.

Thank you

回答1:

DataFrame doesn't provide any equivalent of cogroup function and complex objects are not the first class citizens in the Spark SQL. A set of operations available on complex structures is rather limited so typically you have to either create custom expression what is not trivial or use UDFs and pay a performance penalty. Moreover Spark SQL doesn't use the same join logic as plain RDDs.

Regarding RDDs. While there exist border cases where cogroup can be favorable over join but typically it shouldn't be the case unless the results -> Cartesian product of complete dataset. After all joins on RDDs are expressed using cogroup followed by flatMapValues and since the latter operation is local the only real overhead is creation of the output tuples.

If your tables contain only primitive types you could mimic co-group like behavior by aggregating columns with collect_list first but I wouldn't expect any performance gains here.