efficiently using union in spark

2019-09-14 11:27发布

问题:

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.

Is there more efficient way to do this?

回答1:

Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:

    val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()

    val sqlContext = sparkSession.sqlContext

    var firstTableColumns = Seq("col1", "col2")
    var secondTableColumns = Seq("col3", "col4")

    import sqlContext.implicits._

    var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)

    var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)

    firstDF = firstDF.union(secondDF)

It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function

val rddData = firstDF.rdd