Concatenating datasets of different RDDs in Apache

2019-01-22 18:05发布

Is there a way to concatenate datasets of two different RDDs in spark?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

RDDs are of type spark.sql.SchemaRDD

标签： scala apache-spark apache-spark-sql distributed-computing rdd

2条回答

聊天终结者

2楼-- · 2019-01-22 18:19

I think you are looking for RDD.union

val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect

res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))

0人赞添加讨论(0) 举报

姐就是有狂的资本

3楼-- · 2019-01-22 18:29

I had the same problem. To combine by row instead of column use unionAll:

val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)

I found it after reading the method summary for data frame. More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html

0人赞添加讨论(0) 举报

Concatenating datasets of different RDDs in Apache

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间