Kafka Streaming Concurrency?

I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

回答1:

How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.

This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).

If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.

Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But...

If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.

...because Kafka allows a topic to have many partitions, you get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.

Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?

Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.

回答2:

The kstreams config num.stream.threads allows you to override the number of threads from 1. However, it may be preferable to simply run multiple instances of your streaming app, with all of them running the same consumer group. That way you can spin up as many instances as you need to get optimal partitioning.