spark cross join,two similar code,one works,one no

2020-05-01 09:13发布

问题:

I have a following code:

val ori0 = Seq(
  (0l, "1")
).toDF("id", "col1")
val date0 = Seq(
  (0l, "1")
).toDF("id", "date")

val joinExpression = $"col1" === $"date"
ori0.join(date0, joinExpression).show()

val ori = spark.range(1).withColumn("col1", lit("1"))
val date = spark.range(1).withColumn("date", lit("1"))
ori.join(date,joinExpression).show()

The first join works,but the second has an error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Range (0, 1, step=1, splits=Some(4))
and
Project [_1#11L AS id#14L, _2#12 AS date#15]
+- Filter (isnotnull(_2#12) && (1 = _2#12))
   +- LocalRelation [_1#11L, _2#12]
Join condition is missing or trivial.

I watch it for many time many time, I do not know why it is cross join,and what is the difference between them?

回答1:

If you were to expand the second join you'd see that it is really equivalent to:

SELECT * 
FROM ori JOIN date
WHERE 1 = 1

Clearly WHERE 1 = 1 join condition trivial, which is one of the conditions under which Spark detects Cartesian.

In the first case this is not the case because optimizer cannot infer at this point that join columns contain only a single value, and will attempt to apply hash or sort merge join.