Aggregation over a specific partition in Apache Ka

2019-07-19 14:30发布

问题:

Lets say I have a Kafka topic named SensorData to which two sensors S1 and S2 are sending data (timestamp and value) to two different partitions e.g. S1 -> P1 and S2 -> P2. Now I need to aggregate the values for these two sensors separately, lets say calculating the average sensor value over a time window of 1 hour and writing it into a new topic SensorData1Hour. With this scenario

  1. How can I select a specific topic partition using the KStreamBuilder#stream method?
  2. Is it possible to apply some aggregation function over two (multiple) different partitions from same topic?

回答1:

You cannot (directly) access single partitions and you cannot (directly) apply an aggregation function over multiple partitions.

Aggregations are always done per key: http://docs.confluent.io/current/streams/developer-guide.html#stateful-transformations

  1. Thus, you could use a different key for each partition and than aggregate by key. See http://docs.confluent.io/current/streams/developer-guide.html#windowing-a-stream

The simplest way is to let each of your producers apply a key to each message right away.

  1. If you want to aggregate multiple partitions, you first need to set a new key (e.g., using selectKey()) and set the same key for all data you want to aggregate (if you want to aggregate all partitions, you would use a single key value -- however, keep in mind, this might quickly become a bottleneck!).