case class Foo1(codes:Seq[String], name:String)
case class Foo2(code:String, description:String)
val ds1 = Seq(
Foo1(Seq("A"), "foo1"),
Foo1(Seq("A", "B"), "foo2"),
Foo1(Seq("B", "C", "D"), "foo3"),
Foo1(Seq("C"), "foo4"),
Foo1(Seq("C", "D"), "foo5")
).toDS
val ds2 = Seq(
Foo2("A", "product A"),
Foo2("B", "product B"),
Foo2("C", "product C"),
Foo2("D", "product D"),
Foo2("E", "product E")
).toDS
val j = ds1.join(ds2, ds2("code") isin (ds1("codes")))
Hopefully this Scala code fragment makes it clear what I'm trying to accomplish, our data is structured so that one data set has a column which contains an array of values, and I wish to join the values within that collection to another data set. So for example Seq("A", "B")
in ds1
would join with "A"
and "B"
in ds2
.
The "isin" operator on Column seems to be exactly what I want, and this builds and runs, but when I run it I get the following error:
org.apache.spark.sql.AnalysisException: cannot resolve '(
code
IN (codes
))' due to data type mismatch: Arguments must be same type;;
Reading further I see that isin()
wants to take a varargs ("splatted args") and seems more suitable for a filter()
. So my question is, is this the intended use of this operator, or is there some other way to perform this type of join?