I want to find out only female employees out of the two different JSON files and select only the fields which we are interested in and write the output into another JSON.
Also I am trying to implement it in Google's cloud platform using Dataflow. Can someone please provide any sample Java code which can be implemented to get the result.
Employee JSON
{"emp_id":"OrgEmp#1","emp_name":"Adam","emp_dept":"OrgDept#1","emp_country":"USA","emp_gender":"female","emp_birth_year":"1980","emp_salary":"$100000"}
{"emp_id":"OrgEmp#1","emp_name":"Scott","emp_dept":"OrgDept#3","emp_country":"USA","emp_gender":"male","emp_birth_year":"1985","emp_salary":"$105000"}
Department JSON
{"dept_id":"OrgDept#1","dept_name":"Account","dept_start_year":"1950"}
{"dept_id":"OrgDept#2","dept_name":"IT","dept_start_year":"1990"}
{"dept_id":"OrgDept#3","dept_name":"HR","dept_start_year":"1950"}
The expected output JSON file should be like
{"emp_id":"OrgEmp#1","emp_name":"Adam","dept_name":"Account","emp_salary":"$100000"}
Someone has asked for a Java-based solution for this question. Here is the Java code for this. It's more verbose, but it does essentially the same thing.
With CoGroupByKey, you can use dept_id as a key to group both collections. The way this looks in Beam Java SDK is a
CoGbkResult
.You can do this using
CoGroupByKey
(where shuffle will be used), or using side inputs, if your departments collection is significantly smaller.I will give you code in Python, but you can use the same pipeline in Java.
With side inputs, you will:
Convert your departments PCollection into a dictionary that maps dept_id to the department JSON dictionary.
Then you take the employees PCollection as main input, where you can use the dept_id to get the JSON for each department in the departments PCollection.
Like so:
With
CoGroupByKey
, you can use dept_id as a key to group both collections. This will result in a PCollection of key-value pairs where the key is the dept_id, and the value are two iterables of the department, and the employees in that department.