Thread safely of SerializableFunction in cloud dat

2019-08-16 11:39发布

问题:

I'm implementing the SerializableFunction interface and I'd like to reuse some expensive helper objects that I create in the constructor. When this class is used in a dataflow job, is a new instance created/cloned for every thread that uses it?

Thanks, Genady

回答1:

Short Answer
SerializableFunction does not need to be thread-safe since each thread gets its own deserialized instance. Any references which it accesses within a shared scope (e.g. via static methods/static references/...) need to be thread-safe.

Long Answer
The SerializableFunction is serialized using Java's object serialization mechanism and saved as a part of the Dataflow specification. Depending on the specification and how it is optimized, the SerializableFunction will most likely be broken up into multiple units of work. Each worker machine may then request 1 or more units of work which they process in parallel. Each unit of work will use Java's object serialization mechanism to recreate an instance of the SerializableFunction. Each thread is assigned to only one unit of work. Note that even though each unit of work is assigned to one thread, if the expensive helper objects are not part of the SerializableFunction and instead accessed via another method such as through a static reference/method, then the expensive helper objects may still be shared amongst multiple instances of the same SerializableFunction on the worker.