How to make multiple logs sync in kafka?

Suppose I have 2 types of logs, which have a common field 'uid', and I want to output the log if the log of both of these 2 logs containing the uid arrives, like a join, is it possible for Kafka ?

标签： apache-kafka apache-kafka-streams

1条回答

ら.Afraid

2楼-- · 2019-07-29 23:49

Yes, absolutely. Check out Kafka Streams, specifically the DSL API. It goes something like:

 StreamsBuilder builder = new StreamsBuilder();

 KStream<byte[], Foo> fooStream = builder.stream("foo");

 KStream<byte[], Bar> barStream = builder.stream("bar");

 fooStream.join(barStream,
                (foo, bar) -> {
                    foo.baz = bar.baz;
                    return foo;
                },
                JoinWindows.of(1000))
          .to("buzz");

This simple application consumes two input topics ("foo" and "bar"), joins them and writes them to topic "buzz". Since streams are infinite, when joining two streams you need to specify a join window (1000 milliseconds above), which is the relative time difference between two messages on the respective streams to make them eligible for joining.

Here is a more complete example: https://github.com/confluentinc/kafka-streams-examples/blob/4.0.0-post/src/main/java/io/confluent/examples/streams/PageViewRegionLambdaExample.java

And here is the documentation: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html. You'll find there are many different kinds of joins you can perform:

It is important to note that although the above example will deterministically synchronize streams—if you reset and reprocess the topology, you will get the same result each time—not all join operations in Kafka Streams are deterministic. As of version 1.0.0 and before, approximately half are not deterministic and may depend on the order of data consumed from the underlying topic-partitions. Specifically, inner KStream-KStream and all KTable-KTable joins are deterministic. Other joins, like all KStream-KTable joins and left/outer KStream-KStream joins are non-deterministic and depend on order of data consumed by consumers. Keep this in mind if you are designing your topology to be reprocessable. If you use these non-deterministic operations, when your topology is running live, the order of events as they arrive will produce one result, but if you are reprocessing your topology you may get another result. Note also operations like KStream#merge() do not produce deterministic results either. For more regarding this problem, see Why does my Kafka Streams topology does not replay/reprocess correctly? and this mailing list post

0人赞添加讨论(0) 举报

How to make multiple logs sync in kafka?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间