Dataflow Pipeline Slow

2019-07-25 11:40发布

My Dataflow pipeline is running extremely slow. Its processing approximately 4 elements/2 with 30 worker threads. A single local machine running the same operations (but not in the dataflow framework) is able to process 7 elements/s. The script is written in Python. Data is read from BigQuery.

The workers are n1-standard, and all look to be at 100% CPU utilization.

The operations contained within the combine are:

tokenizes the record and applies stop word filtering (nltk)
stem the word (nltk)
lookup the word in a dictionary
increment the count of said word in a dictionary

Each record is approximately 30-70 KB. Total number of records is ~ 9,000,000 from BigQuery (Log shows all records have been exported successfully)

With 30 worker threads, I expect the throughput to be a lot higher than my single local machine and certainly not half as fast.

What could the problem be?

标签： google-cloud-platform google-cloud-dataflow dataflow

1条回答

虎瘦雄心在

2楼-- · 2019-07-25 12:37

After some performance profiling and testing multiple sized datasets, it appears that this is probably a problem with the huge size of the dictionaries. i.e. The pipeline works fine for thousands (with throughput closer to 40 elements/s), but breaks on millions. I'm closing this topic, as the rabbit hole goes deeper.

Since this is a problem with a specific use case, I thought it would not be relevant to continue on this thread. If you want to follow me on my adventure, the followup questions reside here

0人赞添加讨论(0) 举报

Dataflow Pipeline Slow

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间