I have a PCollection<String>
in Google Cloud DataFlow and I'm outputting it to text files via TextIO.Write.to
:
PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("gs://bucket/output.txt"));
Currently the lines of each shard of output are in random order.
Is it possible to get Dataflow to output the lines in sorted order?
This is not directly supported by Dataflow.
For a bounded
PCollection
, if you shard your input finely enough, then you can write sorted files with aSink
implementation that sorts each shard. You may want to refer to theTextSink
implementation for a basic outline.