Suppose I have 2 types of logs, which have a common field 'uid', and I want to output the log if the log of both of these 2 logs containing the uid arrives, like a join, is it possible for Kafka ?
相关问题
- Delete Messages from a Topic in Apache Kafka
- Serializing a serialized Thrift struct to Kafka in
- Kafka broker shutdown while cleaning up log files
- Getting : Error importing Spark Modules : No modul
- How to transform all timestamp fields when using K
相关文章
- Kafka doesn't delete old messages in topics
- Kafka + Spark Streaming: constant delay of 1 secon
- Spring Kafka Template implementaion example for se
- How to fetch recent messages from Kafka topic
- Determine the Kafka-Client compatibility with kafk
- Kafka to Google Cloud Platform Dataflow ingestion
- Kafka Producer Metrics
- Spark Structured Streaming + Kafka Integration: Mi
Yes, absolutely. Check out Kafka Streams, specifically the DSL API. It goes something like:
This simple application consumes two input topics ("foo" and "bar"), joins them and writes them to topic "buzz". Since streams are infinite, when joining two streams you need to specify a join window (1000 milliseconds above), which is the relative time difference between two messages on the respective streams to make them eligible for joining.
Here is a more complete example: https://github.com/confluentinc/kafka-streams-examples/blob/4.0.0-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java
And here is the documentation: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html. You'll find there are many different kinds of joins you can perform:
It is important to note that although the above example will deterministically synchronize streams—if you reset and reprocess the topology, you will get the same result each time—not all join operations in Kafka Streams are deterministic. As of version 1.0.0 and before, approximately half are not deterministic and may depend on the order of data consumed from the underlying topic-partitions. Specifically, inner
KStream
-KStream
and allKTable
-KTable
joins are deterministic. Other joins, like allKStream
-KTable
joins and left/outerKStream
-KStream
joins are non-deterministic and depend on order of data consumed by consumers. Keep this in mind if you are designing your topology to be reprocessable. If you use these non-deterministic operations, when your topology is running live, the order of events as they arrive will produce one result, but if you are reprocessing your topology you may get another result. Note also operations likeKStream#merge()
do not produce deterministic results either. For more regarding this problem, see Why does my Kafka Streams topology does not replay/reprocess correctly? and this mailing list post