Suppose I have 2 types of logs, which have a common field 'uid', and I want to output the log if the log of both of these 2 logs containing the uid arrives, like a join, is it possible for Kafka ?
问题:
回答1:
Yes, absolutely. Check out Kafka Streams, specifically the DSL API. It goes something like:
StreamsBuilder builder = new StreamsBuilder();
KStream<byte[], Foo> fooStream = builder.stream("foo");
KStream<byte[], Bar> barStream = builder.stream("bar");
fooStream.join(barStream,
(foo, bar) -> {
foo.baz = bar.baz;
return foo;
},
JoinWindows.of(1000))
.to("buzz");
This simple application consumes two input topics ("foo" and "bar"), joins them and writes them to topic "buzz". Since streams are infinite, when joining two streams you need to specify a join window (1000 milliseconds above), which is the relative time difference between two messages on the respective streams to make them eligible for joining.
Here is a more complete example: https://github.com/confluentinc/kafka-streams-examples/blob/4.0.0-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java
And here is the documentation: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html. You'll find there are many different kinds of joins you can perform:
- https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/kstream/KStream.html
- https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/kstream/KTable.html
It is important to note that although the above example will deterministically synchronize streams—if you reset and reprocess the topology, you will get the same result each time—not all join operations in Kafka Streams are deterministic. As of version 1.0.0 and before, approximately half are not deterministic and may depend on the order of data consumed from the underlying topic-partitions. Specifically, inner KStream
-KStream
and all KTable
-KTable
joins are deterministic. Other joins, like all KStream
-KTable
joins and left/outer KStream
-KStream
joins are non-deterministic and depend on order of data consumed by consumers. Keep this in mind if you are designing your topology to be reprocessable. If you use these non-deterministic operations, when your topology is running live, the order of events as they arrive will produce one result, but if you are reprocessing your topology you may get another result. Note also operations like KStream#merge()
do not produce deterministic results either. For more regarding this problem, see Why does my Kafka Streams topology does not replay/reprocess correctly? and this mailing list post