My Dataflow pipeline is running extremely slow. Its processing approximately 4 elements/2 with 30 worker threads. A single local machine running the same operations (but not in the dataflow framework) is able to process 7 elements/s. The script is written in Python. Data is read from BigQuery.
The workers are n1-standard, and all look to be at 100% CPU utilization.
The operations contained within the combine are:
- tokenizes the record and applies stop word filtering (nltk)
- stem the word (nltk)
- lookup the word in a dictionary
- increment the count of said word in a dictionary
Each record is approximately 30-70 KB. Total number of records is ~ 9,000,000 from BigQuery (Log shows all records have been exported successfully)
With 30 worker threads, I expect the throughput to be a lot higher than my single local machine and certainly not half as fast.
What could the problem be?
After some performance profiling and testing multiple sized datasets, it appears that this is probably a problem with the huge size of the dictionaries. i.e. The pipeline works fine for thousands (with throughput closer to 40 elements/s), but breaks on millions. I'm closing this topic, as the rabbit hole goes deeper.
Since this is a problem with a specific use case, I thought it would not be relevant to continue on this thread. If you want to follow me on my adventure, the followup questions reside here