I saw Databricks-Question and don't understand
- Why using UDFs leads to a Cartesian product instead of a full outer join? Obviously the Cartesian product would be a lot more rows than a full outer join(Joins is an example) which is a potential performance hit.
- Any way to force an outer join over the Cartesian product in the example given in Databricks-Question?
Quoting the Databricks-Question here:
I have a Spark Streaming application that uses SQLContext to execute SQL statements on streaming data. When I register a custom UDF in Scala, the performance of the streaming application degrades significantly. Details below:
Statement 1:
Select col1, col2 from table1 as t1 join table2 as t2 on t1.foo = t2.bar
Statement 2:
Select col1, col2 from table1 as t1 join table2 as t2 on equals(t1.foo,t2.bar)
I register a custom UDF using SQLContext as follows:
sqlc.udf.register("equals", (s1: String, s2:String) => s1 == s2)
On the same input and Spark configuration, Statement2 performance significantly worse(close to 100X) compared to Statement1.