I have two RDD
s. One RDD
is of type RDD[(String, String, String)]
and the second RDD
is of type RDD[(String, String, String, String, String)]
. Whenever I try to perform operations like union, intersection, etc, I get the error :-
error: type mismatch;
found: org.apache.spark.rdd.RDD[(String, String, String, String,String, String)]
required: org.apache.spark.rdd.RDD[(String, String, String)]
uid.union(uid1).first()
How can I perform the set operations in this case? If set operations are not possible at all, what can I do to get the same result as set operations without having the type mismatch problem?
EDIT:
Here's a sample of the first lines from both the RDDs :
(" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")
(fb_100007609418328,-795000,r316079113_serv60i)
Several operations require two RDD
s to have the same type.
Let's take union
for example: union
basically concatenates two RDD
s. As you can imagine it would be unsound to concatenate the following:
RDD1
(1, 2)
(3, 4)
RDD2
(5, 6, "string1")
(7, 8, "string2")
As you see, RDD2
has one extra column. One thing that you can do, is work on RDD1
to that its schema matches that of RDD2
, for example by adding a default value:
RDD1
(1, 2)
(3, 4)
RDD1 (AMENDED)
(1, 2, "default")
(3, 4, "default")
RDD2
(5, 6, "string1")
(7, 8, "string2")
UNION
(1, 2, "default")
(3, 4, "default")
(5, 6, "string1")
(7, 8, "string2")
You can achieve this with the following code:
val sc: SparkContext = ??? // your SparkContext
val rdd1: RDD[(Int, Int)] =
sc.parallelize(Seq((1, 2), (3, 4)))
val rdd2: RDD[(Int, Int, String)] =
sc.parallelize(Seq((5, 6, "string1"), (7, 8, "string2")))
val amended: RDD[(Int, Int, String)] =
rdd1.map(pair => (pair._1, pair._2, "default"))
val union: RDD[(Int, Int, String)] =
amended.union(rdd2)
If you know print the contents
union.foreach(println)
you will get what we ended up having in the above example.
Of course, the exact semantics of how you want the two RDD
s to match depend on your problem.