Value join is not a member of org.apache.spark.rdd

2020-03-31 03:50发布

问题:

This function seems valid for my IDE:

def zip[T, U](rdd1:RDD[T], rdd2:RDD[U]) : RDD[(T,U)] = {
    rdd1
      .zipWithIndex
      .map(_.swap)
      .join(
        rdd2
          .zipWithIndex
          .map(_.swap))
      .values
}

But when I compile, I get :

value join is not a member of org.apache.spark.rdd.RDD[(Long, T)] possible cause: maybe a semicolon is missing before `value join'? .join(

I am in Spark 1.6, I have already tried to import org.apache.spark.rdd.RDD._ and the code inside the function works well when it is directly used on two RDDs outside of a function definition.

Any idea ?

回答1:

If you change the signature:

def zip[T, U](rdd1:RDD[T], rdd2:RDD[U]) : RDD[(T,U)] = {

into:

def zip[T : ClassTag, U: ClassTag](rdd1:RDD[T], rdd2:RDD[U]) : RDD[(T,U)] = {

This will compile.

Why? join is a method of PairRDDFunctions (your RDD is implicitly converted into that class), which has the following signature:

class PairRDDFunctions[K, V](self: RDD[(K, V)])
  (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)

This means its constructor expects implicit values of types ClassTag[T] and ClassTag[U], as these will be used as the value types (the V in the PairRDDFunctions definition). Your method has no knowledge of what T and U are, and therefore cannot provide matching implicit values. This means the implicit conversion into PairRDDFunctions "fails" (compiler doesn't perform the conversion) and therefore the method join can't be found.

Adding [K : ClassTag] is shorthand for adding an implicit argument implicit kt: ClassTag[K] to the method, which is then used by compiler and passed to the constructor of PairRDDFunctions.

For more about ClassTags and what they're good for see this good article.