Spark streaming and mutable broadcast variable

2019-04-17 03:50发布

问题:

I found this link https://gist.github.com/BenFradet/c47c5c7247c5d5d0f076 which shows an implementation where in spark, broadcast variable is being updated. Is this a valid implementation meaning will executors see the latest value of broadcast variable?

回答1:

The code you are referring to is using Broadcast.unpersist() method. If you check Spark API Broadcast.unpersist() method it says "Asynchronously delete cached copies of this broadcast on the executors. If the broadcast is used after this is called, it will need to be re-sent to each executor." There is an overloaded method unpersist(boolean blocking) which will block until unpersisting has completed. So it depends how are you using Broadcast variable in your Spark application. In spark there is no auto-re-broadcast if you mutate a broadcast variable. Driver has to resend it. Spark documentation says you shouldn't modify broadcast variable (Immutable) to avoid any inconsistency in processing at executor nodes but there are unpersist() and destroy() methods available if you want to control the broadcast variable's life cycle. Please refer spark jira https://issues.apache.org/jira/browse/SPARK-6404