I want to use Kafka to "divide the work". I want to publish instances of work to a topic, and run a cloud of identical consumers to process them. As each consumer finishes its work, it will pluck the next work from the topic. Each work should only be processed once by one consumer. Processing work is expensive, so I will need many consumers running on many machines to keep up. I want the number of consumers to grow and shrink as needed (I plan to use Kubernetes for this).
I found a pattern where a unique partition is created for each consumer. This "divides the work", but the number of partitions is set when the topic is created. Furthermore, the topic must be created on the command line e.g.
bin/kafka-topics.sh --zookeeper localhost:2181 --partitions 3 --topic divide-topic --create --replication-factor 1
...
for n in range(0,3):
consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'])
partition = TopicPartition('divide-topic',n)
consumer.assign([partition])
...
I could create a unique topic for each consumer, and write my own code to assign work to those topic. That seems gross, and I still have to create topics via the command line.
A work queue with a dynamic number of parallel consumers is a common architecture. I can't be the first to need this. What is the right way to do it with Kafka?
I think, you are on right path -
Here are some steps involved -
Alternatively, you can use Kafka Streams APIS. Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.
Thank you Mickael for pointing me in the correct direction.
https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html
https://dzone.com/articles/dont-use-apache-kafka-consumer-groups-the-wrong-wa,
Example code for dividing the work among 3 consumers, up to a maximum of 100:
...
The pattern you found is accurate. Note that topics can also be created using the Kafka Admin API and partitions can also be added once a topic has been created (with some gotchas).
In Kafka, the way to divide work and allow scaling is to use partitions. This is because in a consumer group, each partition is consumed by a single consumer at any time.
For example, you can have a topic with 50 partitions and a consumer group subscribed to this topic:
When the throughput is low, you can have only a few consumers in the group and they should be able to handle the traffic.
When the throughput increases, you can add consumers, up to the number of partitions (50 in this example), to pick up some of the work.
In this scenario, 50 consumers is the limit in terms of scaling. Consumers expose a number of metrics (like lag) allowing you to decide if you have enough of them at any time