ZeroMQ doesn't auto-reconnect

2019-06-20 14:45发布

问题:

I've just downloaded and installed zeromq-4.0.5 on an Unbutu Precise (12.04) system. I've compiled the hello-world client (REQ, connect, 127.0.0.1) and server (REP, bind) written in C.

  1. I start the server.
  2. I start the client.
  3. Each second the client sends a message to the server, and receives a response.
  4. I press Ctrl-C to stop the server.
  5. The client tries to send its next outgoing message and it gets stuck in an never-returning epoll system call (as shown by strace).
  6. I restart the server.
  7. The zmq_recv call in the client is still stuck, even when the new server has been running for a minute. The only way to make progress for the client is to kill it (with Ctrl-C) and restart it.

Q1: Is this the expected behavior? I'd expect that in a few seconds the client should figure out that the server is running again, and it would auto-reconnect.

Q2: What should I change in the example code to fix this?

Q3: Am I using the wrong version of the software, or is something broken on my system?

I've disabled the firewall, sudo iptables -S prints -P INPUT ACCEPT; -P FORWARD ACCEPT; -P OUTPUT ACCEPT.

In the strace -f ./hwclient output I can see that the client is trying connect() 10 times a second (the default value of ZMQ_RECONNECT_IVL) after the server went down. On the strace -f ./hwserver output I can see that the restarted server accept()s the connection. However, communication gets stuck after that, and the server never receives the actual request from the client (but it notices when I kill the client; also the server receives requests from other clients which have been started after the server restart).

Using ipc:// instead of tcp:// causes the same behavior.

The auto-reconnect happens in successfully in zmq_send if the server has been killed before the client does the next zmq_send. However, when the server gets killed while the client is running zmq_recv, then the zmq_recv blocks indefinitely, and the client can't seem to recover from that.

I've found this article, which recommends using timeouts. However, I think that timeouts can't be the right solution, because the TCP disconnect notification is already available in the client process, and it's already acting on it -- it just doesn't make zmq_recv resend the request to the new server -- or at least return early indicating an error.

回答1:

You may having the same issue that zemomq just fixed for me in 4.0.6 (issue 1362). Basically, the subscriber socket wouldn't always resend it's filter back over during a reconnection (an empty filter means no messages from publisher to that subscriber). The only way to recover was to restart the client's application. Their fix seems to have done the job. The issue was really highlighted when using a transport (like stunnel) to tunnel the connections. Without 4.0.6, I was able to get around the issue by setting the "immediate" flag on the subscriber socket.



回答2:

A3: No.

A2: Do not expect demo to have a design for fault-resilient operations

A1: Yes.


Where to go for more details?

A best next step you may do for this is IMHO to get a bit more global view, which may sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of the Code Connected, Volume 1 [asPdf->], if it were not the case of reading step-by-step there.

The fastest-ever learning-curve would be to have first an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair for a possible High-availability approach and then go back to the roots, elements and details.