I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
Question: how can we assign rank/row numbers to the elements in a pcollection?.
This is very close to the
Sample
transform, but not quite, because it applies the same threshold to all keys when used as.perKey()
. Generally, Beam currently doesn't support per-key combines with different combine function parameters.I'd recommend to emulate it by using
CoGroupByKey
to joinpEmployees
andpDepartments
and obtain tuples (CoGbkResult
) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.