Ranking pcollection elements

2019-08-22 03:04发布

I am using Google DataFlow Java SDK 2.2.0. Use case as follows:

PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.

PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.

task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).

We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.

Question: how can we assign rank/row numbers to the elements in a pcollection?.

标签： google-cloud-dataflow apache-beam

1条回答

闹够了就滚

2楼-- · 2019-08-22 03:32

This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.

I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.

0人赞添加讨论(0) 举报

Ranking pcollection elements

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间