I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.
Is there more efficient way to do this?
Why don't you convert the two RDDs
to dataframes
and use union
function.
Converting to dataframe
is easy you just need to import sqlContext.implicits._
and apply .toDF()
function with header names
.
for example:
val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()
val sqlContext = sparkSession.sqlContext
var firstTableColumns = Seq("col1", "col2")
var secondTableColumns = Seq("col3", "col4")
import sqlContext.implicits._
var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)
var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)
firstDF = firstDF.union(secondDF)
It should be very easy for you to work with dataframes
than with RDDs
. Changing dataframe
to RDD
is quite easy too, just call .rdd
function
val rddData = firstDF.rdd