Is the Iterable created by GroupByKey ordered

2019-08-09 11:54发布

问题:

ie, if my window is Window.into(new GlobalWindows()) .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(0))) .accumulatingFiredPanes();

After I group by key, the next step in the pipeline receives an Iterable every time a new element enters the window for that key, can I reliably say that the last or first element of that Iterable is the element that entered the window?

We have a stream of forum comments coming in, potentially out of order, and we want as an output a list of the number of comments a topic had, for every time that a comment was made. If we have a comment come in late we need to reissue all of the states of the topic that we previously issued that follow this comment, as their numbers are now off by one.

ie, input: topic_id, event_time 1, 1 1, 2 1, 3 1, 4 1, 0 // out of order 1, 5

output: topic_id, state_time, num_comments 1, 1, 1 // in order, issue states accumulating as they came in 1, 2, 2 1, 3, 3 1, 4, 4 1, 0, 1 // got out of order event, need to reissue everything after it 1, 1, 2 // reissue 1, 2, 3 // reissue 1, 3, 4 // reissue 1, 4, 5 // reissue 1, 5, 5 // back to normal processing

The example is contrived, in reality the output represented by "num_comments" is reasonably complicated logic that needs to see all of the data that existed for a topic for that time.

Obviously one option would just be to reissue all states for every event. But that would increase the amount of data a fair bit.

回答1:

No, the Iterable<V> in the PCollection<KV<K, Iterable<V>>> returned by GroupByKey has no ordering guarantees.

Could you elaborate in the question on what you're trying to achieve and why you need the ordering? We've found that in nearly all cases when people needed sorting in GBK, there was an alternative way to achieve their goal.