Singleton in Google Dataflow

2019-06-05 10:09发布

问题:

I have a dataflow which reads the messages from PubSub. I need to enrich this message using couple of API's. I want to have a single instance of this API to used for processing all records. This is to avoid initializing API for every request.

I tried creating a static variable, but still I see the API is initialized many times.

How to avoid initializing of a variable multiple times in Google Dataflow?

回答1:

Dataflow uses multiple machines in parallel to do data analysis, so your API will have to be initialized at least once per machine.

In fact, Dataflow does not have strong guarantees on the life of these machines, so they may come and go relatively frequently.

A simple way to have your job access an external service and avoid initializing the API too much is to initialize it in your DoFn:

class APICallingDoFn extends DoFn {
    private ExternalServiceHandle handle = null;

    @Setup
    public void initializeExternalAPI() {
      // ...
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
        // ... process each element -- setup will have been called
    }
}

You need to do this because Beam nor Dataflow guarantee the duration of a DoFn instance, or a worker.

Hope this helps.