Write streaming data to GCS using Apache Beam

2019-07-28 07:42发布

问题:

How to write messages received from PubSub to a text file in GCS using TextIO in Apache Beam? Saw some methods like withWindowedWrites() and withFilenamePolicy() but couldn't find any example of it in the documentation.

回答1:

Here is an example provided you are using the Java SDK (BEAM 2.1.0).

PipelineOptions options = PipelineOptionsFactory.fromArgs(args)
                                                    .withValidation()
                                                    .as(PipelineOptions.class);

Pipeline pipeline = Pipeline.create(options);

pipeline.begin()
               .apply("PubsubIO",PubsubIO.readStrings()
                     .withTimestampAttribute("timestamp")
                     .fromSubscription("projects/YOUR-PROJECT/subscriptions/YOUR-SUBSCRIPTION"))
               .apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(30L))))
               .apply(TextIO.write().to("gs://YOUR-BUCKET").withWindowedWrites());

You can see the defaults that the SDK uses for the file naming by exploring the "expand" method in TextIO.Write.expand(PCollection input). Specifically I'd take a look at DefaultFilenamePolicy.java