I have built an RDD from a file where each element in the RDD is section from the file separated by a delimiter.
val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
.coalesce(100*spark.defaultParallelism)
.zipWithIndex() //RDD[String, Long]
.filter(f => f._2!=0)
The reason I do the last operation above (filter) is to remove the first index 0.
Is there a better way to remove the first element rather than to check each element for the index value as done above?
Thanks!
One possibility is to use
RDD.mapPartitionsWithIndex
and to remove the first element from the iterator at index 0:This way, you only ever advance a single item on the first iterator, where all others remain untouched. Is this be more efficient? Probably. Anyway, I'd test both versions to see which one performs better.