Why using a UDF in a SQL query leads to cartesian

2019-01-12 12:53发布

问题:

I saw Databricks-Question and don't understand

  1. Why using UDFs leads to a Cartesian product instead of a full outer join? Obviously the Cartesian product would be a lot more rows than a full outer join(Joins is an example) which is a potential performance hit.
  2. Any way to force an outer join over the Cartesian product in the example given in Databricks-Question?

Quoting the Databricks-Question here:

I have a Spark Streaming application that uses SQLContext to execute SQL statements on streaming data. When I register a custom UDF in Scala, the performance of the streaming application degrades significantly. Details below:

Statement 1:

Select col1, col2 from table1 as t1 join table2 as t2 on t1.foo = t2.bar

Statement 2:

Select col1, col2 from table1 as t1 join table2 as t2 on equals(t1.foo,t2.bar)

I register a custom UDF using SQLContext as follows:

sqlc.udf.register("equals", (s1: String, s2:String) => s1 == s2)

On the same input and Spark configuration, Statement2 performance significantly worse(close to 100X) compared to Statement1.

回答1:

Why using UDFs leads to a Cartesian product instead of a full outer join?

The reason why using UDFs require Cartesian product is quite simple. Since you pass an arbitrary function with possibly infinite domain and non-deterministic behavior the only way to determine its value is to pass arguments and evaluate. It means you simply have to check all possible pairs.

Simple equality from the other hand has a predictable behavior. If you use t1.foo = t2.bar condition you can simply shuffle t1 and t2 rows by foo and bar respectively to get expected result.

And just to be precise in the relational algebra outer join is actually expressed using natural join. Anything beyond that is simply an optimization.

Any way to force an outer join over the Cartesian product

Not really, unless you want to modify Spark SQL engine.