At my organization, we have a number of redis workers in place for our critical tasks. Usually, once or twice in a day, our workers stop processing the queues.
The code essentially looks like this:
while ($item = $redis->blpop(array('someQueue', 'anotherQueue'), 3600)) {
someFunction();
}
If you see, there's not much that is happening in terms of the code, but every once in a while, the queue starts building up and the worker doesn't pop any item from the queue. Setting the timeout for blpop
is not useful at all because we presume that the problem is with the redis client connection.
At the moment, we have set up a few listeners which alert us when the queue builds up and then we restart the workers but the problem still persists. We can also set a timeout for our redis client, but then again this is not an ideal solution.
- Has anyone else ever faced this?
- What might be the problem?
- Are we doing something wrong?
Our question is similar to Error in implementing message queue using redis, error in using BLPOP but we do not get any errors. The worker just stops abruptly.
Information
Redis Server: 2.8.2
PHP Redis: phpredis
Update #1
The workers which have been running for a long time have stopped processing the queue. After running CLIENT LIST
we noticed that these workers have a high idle time compared to the rest and their flag is set to N
instead of b
. What might be the reason behind this?
Update #2
The problem was with someFunction()
. There was a piece of code causing the function to not return control due to which the client was idling for a long time and hence the 'N' flag on running CLIENT LIST
.
We'd a different problem: if application server loses connectivity with Redis server for a moment, Redis handle becomes invalid (btw, we expect this - this is not a bug). Although your issue is different, the work around we used might work for you as well:
You can do something like this:
I suggest verifying if there is an issue and report the problem back to the Redis project as an issue if you find something server side. However the following steps will help you to fix the problem even if in some other part of your stack (which is likely, since there are no known problems similar to the one above).
Steps to check what is happening:
LLEN
command.CLIENT LIST
that there is actually your client listed, executing a blocking pop (you'll see the command name), and check what is the size of the reply to see if it is that is your client which is not actually consuming the replies it gets.Random remarks: