I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for both are the same. Is there a way to avoid reshuffling the left side of the join after the first completes? Currently I am just leaving the output as a KV>. This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect. I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.
This was prior to keeping Iterables grouped after the first join
Have you considered accessing the smaller datasets as
View.asMap()
orView.asMultimap()
side inputs when processing the main input? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory.