Update the internal state of RDD elements

2019-09-06 03:25发布

问题:

I'm newbie in Spark and I want to update the internal state of my RDD's elements with rdd.foreach method, but it doesn't work. Here is my code example:

class Test extends Serializable{
  var foo = 0.0
  var bar = 0.0

  def updateFooBar() = {
    foo = Math.random()
    bar = Math.random()
  }
}

var testList = Array.fill(5)(new Test())
var testRDD = sc.parallelize(testList)
testRDD.foreach{ x => x.updateFooBar() }
testRDD.collect().foreach { x=> println(x.foo+"~"+x.bar) }

and the result is:

0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0

回答1:

RDDs are immutable by design. This design choice makes them more robust, as mutation is a common source of bugs, and it supports the "resilient" part of the RDD name (resilient distributed dataset); if a partition in a downstream RDD is lost, Spark can reconstruct it from its parents. So, it's best to think of Spark programming as construction of dataflows, even when you're not doing streaming.

On foreach, it's designed for "pure side effect" operations, like writing to disk, a database, or the console.