I am trying to use Apache Beam/Google Cloud Dataflow to speed up an existing Python application. The bottleneck of the application occurs after randomly permuting an input matrix N
(default 125, but could be more) times, when the system runs a clustering algorithm on each matrix. The runs are fully independent of one another. I've captured the top of the pipeline below:
This processes the default 125 permutations. As you can see, only the RunClustering
step takes an appreciable amount of time (there are 11 more steps not shown below that total to 11 more seconds). I ran the pipeline earlier today for just 1 permutation, and the Run Clustering
step takes 3 seconds (close enough to 1/125th the time shown above).
I'd like the RunClustering
step to finish in 3-4 seconds no matter what the input N
is. My understanding is that Dataflow is the correct tool for speeding up embarrassingly-parallel computation on Google Cloud Platform, so I've spent a couple weeks learning it and porting my code. Is my understanding correct? I've also tried throwing more machines at the problem (instead of Autoscaling, which, for whatever reason, only scales up to 2-3 machines*) and specifying more powerful machine types, but those don't help.
*Is this because of a long startup time for VMs? Is there a way to use quickly-provisioned VMs, if that's the case? Another question I have is how to cut down on the pipeline startup time; it's a deal breaker if users can't get results back quickly, and the fact that the total Dataflow job time is 13–14 minutes (compared to the already excessive 6–7 for the pipeline) is unacceptable.