Lets say I have a Kafka topic named SensorData
to which two sensors S1 and S2 are sending data (timestamp and value) to two different partitions e.g. S1 -> P1 and S2 -> P2. Now I need to aggregate the values for these two sensors separately, lets say calculating the average sensor value over a time window of 1 hour and writing it into a new topic SensorData1Hour
. With this scenario
- How can I select a specific topic partition using the
KStreamBuilder#stream
method?
- Is it possible to apply some aggregation function over two (multiple) different partitions from same topic?
You cannot (directly) access single partitions and you cannot (directly) apply an aggregation function over multiple partitions.
Aggregations are always done per key
: http://docs.confluent.io/current/streams/developer-guide.html#stateful-transformations
- Thus, you could use a different key for each partition and than aggregate by key. See http://docs.confluent.io/current/streams/developer-guide.html#windowing-a-stream
The simplest way is to let each of your producers apply a key to each message right away.
- If you want to aggregate multiple partitions, you first need to set a new key (e.g., using
selectKey()
) and set the same key for all data you want to aggregate (if you want to aggregate all partitions, you would use a single key value -- however, keep in mind, this might quickly become a bottleneck!).