How to define a global scala variable in Spark whi

2020-04-17 07:20发布

问题:

In Spark program ,I WANT To define a variable like immutable map which will be accessed by all worker programs synchrononously, what can I do ? Should I define an scala object?

Not only immutable map , what if I want a variable that can be shared and can be updated synchronously? For example , a 'mutable map' , a 'var Int' or 'var String' or some others?What can I do? Is an scala object variable OK?For example :

Object SparkObj{
var x:Int
var y:String
}
  1. Is x and y maintained by driver instead of worker and shared by all workers?
  2. Is x and y have only one copy instead of several copies?

  3. Is the update to x and y synchronous?

回答1:

If you refer to a variable inside a closure that runs on the workers, it will be captured, serialized and sent to the workers. For example:

val i = 5
rdd.map(_ + i) // "i" is sent to the workers, they add 5 to each element.

Nothing is sent back from the workers, however. If you add something to a mutable.Seq inside a worker, the change will not be visible from anywhere. You'll be modifying an object that is discarded after the closure is executed.

Apache Spark provides a number of primitives for performing distributed computing. Synchronized mutable state is not one of these.