I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
Question: how can we assign rank/row numbers to the elements in a pcollection?.