Remove first element in RDD without using filter f

2019-07-03 23:20发布

I have built an RDD from a file where each element in the RDD is section from the file separated by a delimiter.

val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
                                              .coalesce(100*spark.defaultParallelism) 
                                              .zipWithIndex() //RDD[String, Long]
                                              .filter(f => f._2!=0)

The reason I do the last operation above (filter) is to remove the first index 0.

Is there a better way to remove the first element rather than to check each element for the index value as done above?

Thanks!

标签： scala apache-spark rdd

1条回答

孤傲高冷的网名

2楼-- · 2019-07-04 00:24

One possibility is to use RDD.mapPartitionsWithIndex and to remove the first element from the iterator at index 0:

val inputRDD = myUtilities
                .paragraphFile(spark,path1)
                .coalesce(100*spark.defaultParallelism) 
                .mapPartitionsWithIndex(
                   (index, it) => if (index == 0) it.drop(1) else it,
                    preservesPartitioning = true
                 )

This way, you only ever advance a single item on the first iterator, where all others remain untouched. Is this be more efficient? Probably. Anyway, I'd test both versions to see which one performs better.

0人赞添加讨论(0) 举报

Remove first element in RDD without using filter f

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间