I've just downloaded and installed zeromq-4.0.5 on an Unbutu Precise (12.04) system. I've compiled the hello-world client (REQ
, connect, 127.0.0.1) and server (REP
, bind) written in C.
- I start the server.
- I start the client.
- Each second the client sends a message to the server, and receives a response.
- I press Ctrl-C to stop the server.
- The client tries to send its next outgoing message and it gets stuck in an never-returning epoll system call (as shown by strace).
- I restart the server.
- The
zmq_recv
call in the client is still stuck, even when the new server has been running for a minute. The only way to make progress for the client is to kill it (with Ctrl-C) and restart it.
Q1: Is this the expected behavior? I'd expect that in a few seconds the client should figure out that the server is running again, and it would auto-reconnect.
Q2: What should I change in the example code to fix this?
Q3: Am I using the wrong version of the software, or is something broken on my system?
I've disabled the firewall, sudo iptables -S
prints -P INPUT ACCEPT
; -P FORWARD ACCEPT
; -P OUTPUT ACCEPT
.
In the strace -f ./hwclient
output I can see that the client is trying connect()
10 times a second (the default value of ZMQ_RECONNECT_IVL
) after the server went down. On the strace -f ./hwserver
output I can see that the restarted server accept()
s the connection. However, communication gets stuck after that, and the server never receives the actual request from the client (but it notices when I kill the client; also the server receives requests from other clients which have been started after the server restart).
Using ipc://
instead of tcp://
causes the same behavior.
The auto-reconnect happens in successfully in zmq_send
if the server has been killed before the client does the next zmq_send
. However, when the server gets killed while the client is running zmq_recv
, then the zmq_recv
blocks indefinitely, and the client can't seem to recover from that.
I've found this article, which recommends using timeouts. However, I think that timeouts can't be the right solution, because the TCP disconnect notification is already available in the client process, and it's already acting on it -- it just doesn't make zmq_recv
resend the request to the new server -- or at least return early indicating an error.