Output sorted text file from Google Cloud Dataflow

2019-03-03 03:54发布

问题:

I have a PCollection<String> in Google Cloud DataFlow and I'm outputting it to text files via TextIO.Write.to:

PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("gs://bucket/output.txt"));

Currently the lines of each shard of output are in random order.

Is it possible to get Dataflow to output the lines in sorted order?

回答1:

This is not directly supported by Dataflow.

For a bounded PCollection, if you shard your input finely enough, then you can write sorted files with a Sink implementation that sorts each shard. You may want to refer to the TextSink implementation for a basic outline.