Ranking pcollection elements

2019-08-22 02:47发布

问题:

I am using Google DataFlow Java SDK 2.2.0. Use case as follows:

PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.

PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.

task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).

We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.

Question: how can we assign rank/row numbers to the elements in a pcollection?.

回答1:

This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.

I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.