How to perform Set transformations on RDD's wi

2019-09-16 06:15发布

问题:

I have two RDDs. One RDD is of type RDD[(String, String, String)] and the second RDD is of type RDD[(String, String, String, String, String)]. Whenever I try to perform operations like union, intersection, etc, I get the error :-

error: type mismatch;
found: org.apache.spark.rdd.RDD[(String, String, String, String,String, String)]
required: org.apache.spark.rdd.RDD[(String, String, String)]
   uid.union(uid1).first()

How can I perform the set operations in this case? If set operations are not possible at all, what can I do to get the same result as set operations without having the type mismatch problem?

EDIT:

Here's a sample of the first lines from both the RDDs :

(" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502") 

(fb_100007609418328,-795000,r316079113_serv60i) 

回答1:

Several operations require two RDDs to have the same type.

Let's take union for example: union basically concatenates two RDDs. As you can imagine it would be unsound to concatenate the following:

RDD1
(1, 2)
(3, 4)

RDD2
(5, 6, "string1")
(7, 8, "string2")

As you see, RDD2 has one extra column. One thing that you can do, is work on RDD1 to that its schema matches that of RDD2, for example by adding a default value:

RDD1
(1, 2)
(3, 4)

RDD1 (AMENDED)
(1, 2, "default")
(3, 4, "default")

RDD2
(5, 6, "string1")
(7, 8, "string2")

UNION
(1, 2, "default")
(3, 4, "default")
(5, 6, "string1")
(7, 8, "string2")

You can achieve this with the following code:

val sc: SparkContext = ??? // your SparkContext

val rdd1: RDD[(Int, Int)] =
  sc.parallelize(Seq((1, 2), (3, 4)))

val rdd2: RDD[(Int, Int, String)] =
  sc.parallelize(Seq((5, 6, "string1"), (7, 8, "string2")))

val amended: RDD[(Int, Int, String)] =
  rdd1.map(pair => (pair._1, pair._2, "default"))

val union: RDD[(Int, Int, String)] =
  amended.union(rdd2)

If you know print the contents

union.foreach(println)

you will get what we ended up having in the above example.

Of course, the exact semantics of how you want the two RDDs to match depend on your problem.