This function seems valid for my IDE:
def zip[T, U](rdd1:RDD[T], rdd2:RDD[U]) : RDD[(T,U)] = {
rdd1
.zipWithIndex
.map(_.swap)
.join(
rdd2
.zipWithIndex
.map(_.swap))
.values
}
But when I compile, I get :
value join is not a member of
org.apache.spark.rdd.RDD[(Long, T)] possible cause: maybe a semicolon
is missing before `value join'?
.join(
I am in Spark 1.6, I have already tried to import org.apache.spark.rdd.RDD._
and the code inside the function works well when it is directly used on two RDDs outside of a function definition.
Any idea ?
If you change the signature:
def zip[T, U](rdd1:RDD[T], rdd2:RDD[U]) : RDD[(T,U)] = {
into:
def zip[T : ClassTag, U: ClassTag](rdd1:RDD[T], rdd2:RDD[U]) : RDD[(T,U)] = {
This will compile.
Why? join
is a method of PairRDDFunctions
(your RDD
is implicitly converted into that class), which has the following signature:
class PairRDDFunctions[K, V](self: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
This means its constructor expects implicit values of types ClassTag[T]
and ClassTag[U]
, as these will be used as the value types (the V
in the PairRDDFunctions
definition). Your method has no knowledge of what T
and U
are, and therefore cannot provide matching implicit values. This means the implicit conversion into PairRDDFunctions
"fails" (compiler doesn't perform the conversion) and therefore the method join
can't be found.
Adding [K : ClassTag]
is shorthand for adding an implicit argument implicit kt: ClassTag[K]
to the method, which is then used by compiler and passed to the constructor of PairRDDFunctions
.
For more about ClassTags and what they're good for see this good article.