Update the internal state of RDD elements

2019-09-06 02:44发布

I'm newbie in Spark and I want to update the internal state of my RDD's elements with rdd.foreach method, but it doesn't work. Here is my code example:

class Test extends Serializable{
  var foo = 0.0
  var bar = 0.0

  def updateFooBar() = {
    foo = Math.random()
    bar = Math.random()
  }
}

var testList = Array.fill(5)(new Test())
var testRDD = sc.parallelize(testList)
testRDD.foreach{ x => x.updateFooBar() }
testRDD.collect().foreach { x=> println(x.foo+"~"+x.bar) }

and the result is:

0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0

标签： apache-spark rdd

1条回答

Rolldiameter

2楼-- · 2019-09-06 03:34

RDDs are immutable by design. This design choice makes them more robust, as mutation is a common source of bugs, and it supports the "resilient" part of the RDD name (resilient distributed dataset); if a partition in a downstream RDD is lost, Spark can reconstruct it from its parents. So, it's best to think of Spark programming as construction of dataflows, even when you're not doing streaming.

On foreach, it's designed for "pure side effect" operations, like writing to disk, a database, or the console.

0人赞添加讨论(0) 举报

Update the internal state of RDD elements

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间