I am trying to share the task among the multiple spouts. I have a situation, where I'm getting one tuple/message at a time from external source and I want to have multiple instances of a spout, main intention behind is to share the load and increase performance efficiency.
I can do the same with one Spout itself, but I want to share the load across multiple spouts. I am not able to get the logic to spread the load. Since the offset of messages will not be known until the particular spout finishes the consuming the part (i.e based on buffer size set).
Can anyone please put some bright light on the how to work-out on the logic/algorithm?
Advance Thanks for your time.
Update in response to answers:
Now used multi-partitions on Kafka (i.e
5
)Following is the code used:
builder.setSpout("spout", new KafkaSpout(cfg), 5);
Tested by flooding with 800 MB
data on each partition and it took ~22 sec
to finish read.
Again, used the code with parallelism_hint = 1
i.e. builder.setSpout("spout", new KafkaSpout(cfg), 1);
Now it took more ~23 sec
! Why?
According to Storm Docs setSpout() declaration is as follows:
public SpoutDeclarer setSpout(java.lang.String id,
IRichSpout spout,
java.lang.Number parallelism_hint)
where,
parallelism_hint - is the number of tasks that should be assigned to execute this spout. Each task will run on a thread in a process somewhere around the cluster.