I'm having trouble understanding if the joins in Apache Beam (e.g. http://www.waitingforcode.com/apache-beam/joins-apache-beam/read) can join entire rows.
For example:
I have 2 datasets, in CSV format, where the first rows are column headers.
The first:
a,b,c,d
1,2,3,4
5,6,7,8
1,2,5,4
The second:
c,d,e,f
3,4,9,10
I want to left join on columns c and d so that I end up with:
a,b,c,d,e,f
1,2,3,4,9,10
5,6,7,8,,
1,2,5,4,,
However all the documentation on Apache Beam seems to say the PCollection objects need to be of type KV<K, V>
when joining, so I have broken down my PCollection objects to a collection of KV<String, String>
objects (where the key is the column header, and the value is row value). But in that case (where you just have a key with a value) I don't see how the row format can be maintained. How would KV(c,7) "know" that KV(a,5) is from the same row? Is Join meant for this sort of thing at all?
My code so far:
PCollection<KV<String, String>> flightOutput = ...;
PCollection<KV<String, String>> arrivalWeatherDataForJoin = ...;
PCollection<KV<String, KV<String, String>>> output = Join.leftOuterJoin(flightOutput, arrivalWeatherDataForJoin, "");