Reading from Pubsub using Dataflow Java SDK 2

2019-07-09 00:53发布

问题:

A lot of the documentation for the Google Cloud Platform for Java SDK 2.x tell you to reference Beam documentation.

When reading from PubSub using Dataflow, should I still be doing PubsubIO.Read.named("name").topic("");

Or should I be doing something else?

Also building off of that, is there a way to just print PubSub data received by the Dataflow to standard output or to a file?

回答1:

For Apache Beam 2.2.0, you can define the following transform to pull messages from a Pub/Sub subscription:

PubsubIO.readMessages().fromSubscription("subscription_name")

This is one way to define a transform that will pull messages from Pub/Sub. However, the PubsubIO class contains different methods for pulling messages. Each method has slightly different functionality. See the PubsubIO documentation.

You can write the Pub/Sub messages to a file using the TextIO class. See the examples in the TextIO documentation. See the Logging Pipeline Messages documentation for writing Pub/Sub messages to stdout.



回答2:

Adding to what Adrew wrote above. Code to read strings from PubSubIO and write them to stdout (just for debugging) is below. That said, I will file internal bug to improve JavaDoc for PubsubIO, I think the current documentation is minimal.

public static void main(String[] args) {

  Pipeline pipeline = Pipeline.create(PipelineOptionsFactory.fromArgs(args).create());
  pipeline
    .apply("ReadStrinsFromPubsub",
       PubsubIO.readStrings().fromTopic("/topics/my_project/my_topic"))
    .apply("PrintToStdout", ParDo.of(new DoFn<String, Void>() {
      @ProcessElement
      public void processElement(ProcessContext c) {
        System.out.printf("Received at %s : %s\n", Instant.now(), c.element()); // debug log
      }
    }));

  pipeline.run().waitUntilFinish();
}