We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session").
(After this I'd like to count these higher-level events per day, but that's a separate question)
For this I've been looking into session windows; but all docs seem geared towards global session windows, but I'd like to create them per-user (which is also a natural partitioning).
I'm having trouble finding docs (python preferred) on how to do this. Could you point me in the right direction?
Or in other words: How do I create per-user per-session windows that can output more structured (enriched) events?
What I have
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
_, x = element
logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
yield element
def sum_by_event_type(user_session_events):
logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
d = {}
for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
d[key] = len(list(group))
logging.info("After counting: %s", d)
return d
# ...
by_user = valid \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))
session_gap = 5 * 60 # [s]; 5 minutes
user_sessions = by_user \
| 'user_session_window' >> beam.WindowInto(beam.window.Sessions(session_gap),
timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'debug_printer' >> beam.ParDo(DebugPrinter()) \
| beam.CombinePerKey(sum_by_event_type)
What it outputs
INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)
So as you can see; the Session() window doesn't expand the Window, but groups only very close events together... What's being done wrong?
You can get it to work by adding a Group By Key transform after the windowing. You have assigned keys to the records but haven't actually grouped them together by key and session windowing (which works per-key) does not know that these events need to be merged together.
To confirm this I did a reproducible example with some in-memory dummy data (to isolate Pub/Sub from the problem and be able to test it more quickly). All five events will have the same key or
user_id
but they will "arrive" sequentially 1, 2, 4 and 8 seconds apart from each other. As I usesession_gap
of 5 seconds I expect the first 4 elements to be merged into the same session. The 5th event will take 8 seconds after the 4th one so it has to be relegated to the next session (gap over 5s). Data is created like this:We use
beam.Create(data)
to initialize the pipeline andbeam.window.TimestampedValue
to assign the "fake" timestamps. Again, we are just simulating streaming behavior with this. After that, we create the key-value pairs thanks to theuser_id
field, we window intowindow.Sessions
and, we add the missingbeam.GroupByKey()
step. Finally, we log the results with a slightly modified version ofDebugPrinter
:. The pipeline now looks like this:where
DebugPrinter
is:If we test this without grouping by key we get the same behavior:
But after adding it, the windows now work as expected. Events 0 to 3 are merged together in an extended 12s session window. Event 4 belongs to a separate 5s session.
Full code here
Two additional things worth mentioning. The first one is that, even if running this locally in a single machine with the DirectRunner, records can come unordered (event_3 is processed before event_2 in my case). This is done on purpose to simulate distributed processing as documented here.
The last one is that if you get a stack trace like this:
downgrade from 2.10.0/2.11.0 SDK to 2.9.0. See this answer for example.