How to write a nested query?

2019-07-18 17:53发布

问题:

I have following table:

+-----+---+----+
|type | t |code|
+-----+---+----+
|    A| 25|  11|
|    A| 55|  42|
|    B| 88|  11|
|    A|114|  11|
|    B|220|  58|
|    B|520|  11|
+-----+---+----+

And what I want:

+-----+---+----+
|t1   | t2|code|
+-----+---+----+
|   25| 88|  11|
|  114|520|  11|
+-----+---+----+

There are two types of events A and B. Event A is the start, Event B is the end. I want to connect the start with the next end dependence of the code.

It's quite easy in SQL to do this:

SELECT a.t AS t1,  
   (SELECT b.t FROM events AS b WHERE a.code == b.code AND a.t < b.t LIMIT 1) AS t2, a.code AS code
FROM events AS a

But I have to problem to implement this in Spark because it looks like that this kind of nested query isn't supported...

I tried it with:

df.createOrReplaceTempView("events")
val sqlDF = spark.sql(/* SQL-query above */)

Error i get:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in:

Do you have any other ideas to solve that problem?

回答1:

It's quite easy in SQL to do this

And so is in Spark SQL, luckily.

val events = ...
scala> events.show
+----+---+----+
|type|  t|code|
+----+---+----+
|   A| 25|  11|
|   A| 55|  42|
|   B| 88|  11|
|   A|114|  11|
|   B|220|  58|
|   B|520|  11|
+----+---+----+

// assumed that t is int
scala> events.printSchema
root
 |-- type: string (nullable = true)
 |-- t: integer (nullable = true)
 |-- code: integer (nullable = true)

val eventsA = events.
  where($"type" === "A").
  as("a")
val eventsB = events.
  where($"type" === "B").
  as("b")
val solution = eventsA.
  join(eventsB, "code").
  where($"a.t" < $"b.t").
  select($"a.t" as "t1", $"b.t" as "t2", $"a.code").
  orderBy($"t1".asc, $"t2".asc).
  dropDuplicates("t1", "code").
  orderBy($"t1".asc)

That should give you the requested output.

scala> solution.show
+---+---+----+
| t1| t2|code|
+---+---+----+
| 25| 88|  11|
|114|520|  11|
+---+---+----+