I'm experiencing issues scaling my app with multiple requests.
Each request sends an ask to an actor, which then spawns other actors. This is fine, however, under load(5+ asks at once), the ask
takes a massive amount of time to deliver the message to the target actor. The original design was to bulkhead requests evenly, but this is causing a bottleneck. Example:
In this picture, the ask
is sent right after the query plan resolver. However, there is a multi-second gap when the Actor receives this message. This is only experienced under load(5+ requests/sec). I first thought this was a starvation issue.
Design: Each planner-executor is a seperate instance for each request. It spawns a new 'Request Acceptor' actor each time(it logs 'requesting score' when it receives a message).
- I gave the actorsystem a custom global executor(big one). I noticed the threads were not utilized beyond the core threadpool size even during this massive delay
- I made sure all executioncontexts in the child actors used the correct executioncontext
- Made sure all blocking calls inside actors used a future
- I gave the parent actor(and all child) a custom dispatcher with core size 50 and max size 100. It did not request more(it stayed at 50) even during these delays
- Finally, I tried creating a totally new Actorsystem for each request(inside the planner-executor). This also had no noticable effect!
I'm a bit stumped by this. From these tests it does not look like a thread starvation issue. Back at square one, I have no idea why the message takes longer and longer to deliver the more concurrent requests I make. The Zipkin trace before reaching this point does not degrade with more requests until it reaches the ask
here. Before then, the server is able to handle multiple steps to e.g veify the request, talk to the db, and then finally go inside the planner-executor. So I doubt the application itself is running out of cpu time.