I am trying to implement a Reshuffle
transform to prevent excessive fusion, but I don't know how to alter the version for <KV<String,String>>
to deal with simple PCollections. (How to reshuffle PCollection <KV<String,String>>
is described here.)
How would I expand the official Avro I/O example code to reshuffle before adding more steps in my pipeline?
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.Read.named("ReadFromAvro")
.from("gs://my_bucket/path/records-*.avro")
.withSchema(schema));
Thanks to the code snippet provided by the Google support team I figured it out:
To get a reshuffled PCollection:
The Repartition class used: