Resuming what I'm looking for to do with Apache Beam in Google Dataflow is something like LAG in the Azure Stream Analytics
Using a window of X minutes where I'm receiving data:
|||||| |||||| |||||| |||||| |||||| ||||||
| 1 | | 2 | | 3 | | 4 | | 5 | | 6 |
|id=x| |id=x| |id=x| |id=x| |id=x| |id=x|
|||||| ,|||||| ,|||||| ,|||||| ,|||||| ,|||||| , ...
I need to compare the data(n) with data(n-1), for example, following with the previous example, it will be something like this:
if data(6) inside and data(5) outside then ...
if data(5) inside and data(4) outside then ...
if data(4) inside and data(3) outside then ...
if data(3) inside and data(2) outside then ...
if data(2) inside and data(1) outside then ...
Is there any "practical "way to do this?
With Beam, as explained in the docs, state is maintained per key and window. Therefore, you can't access values from previous windows.
To do what you want to do you might need a more complex pipeline design. My idea, developed here as an example, would be to duplicate your messages in a ParDo:
To do the second bullet point, we can add the duration of a window (
WINDOW_SECONDS
) to the element timestamp:We call the function specifying the correct tags:
and then apply the same windowing scheme to both, co-group by key, etc.
Finally, we can have both values (old and new) inside the same ParDo:
To test this I run the pipeline with the direct runner and, on a separate shell, I publish two messages more than 10 seconds apart (in my case
WINDOW_SECONDS
was 10s):And the job output shows the expected difference:
Full code for my example here. Take into account performance considerations as you are duplicating elements but it makes sense if you need to have values available during two windows.