How can I maximize throughput for an embarrassingl

I am trying to use Apache Beam/Google Cloud Dataflow to speed up an existing Python application. The bottleneck of the application occurs after randomly permuting an input matrix N (default 125, but could be more) times, when the system runs a clustering algorithm on each matrix. The runs are fully independent of one another. I've captured the top of the pipeline below:

This processes the default 125 permutations. As you can see, only the RunClustering step takes an appreciable amount of time (there are 11 more steps not shown below that total to 11 more seconds). I ran the pipeline earlier today for just 1 permutation, and the Run Clustering step takes 3 seconds (close enough to 1/125th the time shown above).

I'd like the RunClustering step to finish in 3-4 seconds no matter what the input N is. My understanding is that Dataflow is the correct tool for speeding up embarrassingly-parallel computation on Google Cloud Platform, so I've spent a couple weeks learning it and porting my code. Is my understanding correct? I've also tried throwing more machines at the problem (instead of Autoscaling, which, for whatever reason, only scales up to 2-3 machines*) and specifying more powerful machine types, but those don't help.

*Is this because of a long startup time for VMs? Is there a way to use quickly-provisioned VMs, if that's the case? Another question I have is how to cut down on the pipeline startup time; it's a deal breaker if users can't get results back quickly, and the fact that the total Dataflow job time is 13–14 minutes (compared to the already excessive 6–7 for the pipeline) is unacceptable.

Your pipeline is suffering from excessive fusion, and ends up doing almost everything on one worker. This is also why autoscaling doesn't scale higher: it detects that it is unable to parallelize your job's code, so it prefers not to waste extra workers. This is also why manually throwing more workers at the problem doesn't help.

In general fusion is a very important optimization, but excessive fusion is also a common problem that, ideally, Dataflow would be able to mitigate automatically (like it automatically mitigates imbalanced sharding), but it is even harder to do, though some ideas for that are in the works.

Meanwhile, you'll need to change your code to insert a "reshuffle" (a group by key / ungroup will do - fusion never happens across a group by key operation). See Preventing fusion; the question Best way to prevent fusion in Google Dataflow? contains some example code.