可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:

=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit@rabbitmq02 not responding **
** Removing (timedout) connection **

=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit@rabbitmq02 lost 'rabbit'

=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit@rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbitmq02}

I tried to simulate the problem by killing the connection between the two nodes using "tcpkill", the cluster has disconnected,and surprisingly the two nodes are not trying to reconnect !

When the cluster breaks, haproxy load balancer still marks both nodes as active and send request to both of them, although they are not in a cluster.

My questions:

If the nodes are configured to work as a cluster, when I get a network failure , why aren't they trying to reconnect after ?
How can I identify broken cluster and shutdown one of the nodes ? I have consistency problems when working with the two nodes separately.

回答1:

One other way to recover from this kind of failure is to work with Mnesia which is the database that RabbitMQ uses as the persistence mechanism and for the synchronization of the RabbitMQ instances (and the master / slave status) are controlled by this. For all the details, refer to the following URL: http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html

Adding the relevant section here:

There are several occasions when Mnesia may detect that the network has been partitioned due to a communication failure.

One is when Mnesia already is up and running and the Erlang nodes gain contact again. Then Mnesia will try to contact Mnesia on the other node to see if it also thinks that the network has been partitioned for a while. If Mnesia on both nodes has logged mnesia_down entries from each other, Mnesia generates a system event, called {inconsistent_database, running_partitioned_network, Node} which is sent to Mnesia's event handler and other possible subscribers. The default event handler reports an error to the error logger.

Another occasion when Mnesia may detect that the network has been partitioned due to a communication failure, is at start-up. If Mnesia detects that both the local node and another node received mnesia_down from each other it generates a {inconsistent_database, starting_partitioned_network, Node} system event and acts as described above.

If the application detects that there has been a communication failure which may have caused an inconsistent database, it may use the function mnesia:set_master_nodes(Tab, Nodes) to pinpoint from which nodes each table may be loaded.

At start-up Mnesia's normal table load algorithm will be bypassed and the table will be loaded from one of the master nodes defined for the table, regardless of potential mnesia_down entries in the log. The Nodes may only contain nodes where the table has a replica and if it is empty, the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used when next restarting.

The function mnesia:set_master_nodes(Nodes) sets master nodes for all tables. For each table it will determine its replica nodes and invoke mnesia:set_master_nodes(Tab, TabNodes) with those replica nodes that are included in the Nodes list (i.e. TabNodes is the intersection of Nodes and the replica nodes of the table). If the intersection is empty the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used at next restart.

The functions mnesia:system_info(master_node_tables) and mnesia:table_info(Tab, master_nodes) may be used to obtain information about the potential master nodes.

Determining which data to keep after communication failure is outside the scope of Mnesia. One approach would be to determine which "island" contains a majority of the nodes. Using the {majority,true} option for critical tables can be a way of ensuring that nodes that are not part of a "majority island" are not able to update those tables. Note that this constitutes a reduction in service on the minority nodes. This would be a tradeoff in favour of higher consistency guarantees.

The function mnesia:force_load_table(Tab) may be used to force load the table regardless of which table load mechanism is activated.

This is a more lengthy and involved way of recovering from such failures .. but will give better granularity and control over data that should be available in the final master node (this can reduce the amount of data loss that might happen when "merging" RabbitMQ masters).

回答2:

RabbitMQ Clusters do not work well on unreliable networks (part of RabbitMQ documentation). So when the network failure happens (in a two node cluster) each node thinks that it is the master and the only node in the cluster. Two master nodes don't automatically reconnect, because their states are not automatically synchronized (even in case of a RabbitMQ slave - the actual message synchronization does not happen - the slave just "catches up" as messages get consumed from the queue and more messages get added).

To detect whether you have a broken cluster, run the command:

rabbitmqctl cluster_status

on each of the nodes that form part of the cluster. If the cluster is broken then you'll only see one node. Something like:

Cluster status of node rabbit@rabbitmq1 ...
[{nodes,[{disc,[rabbit@rabbitmq1]}]},{running_nodes,[rabbit@rabbitmq1]}]
...done.

In such cases, you'll need to run the following set of commands on one of the nodes that formed part of the original cluster (so that it joins the other master node (say rabbitmq1) in the cluster as a slave):

rabbitmqctl stop_app

rabbitmqctl reset

rabbitmqctl join_cluster rabbit@rabbitmq1

rabbitmqctl start_app

Finally check the cluster status again .. this time you should see both the nodes.

Note: If you have the RabbitMQ nodes in an HA configuration using a Virtual IP (and the clients are connecting to RabbitMQ using this virtual IP), then the node that should be made the master should be the one that has the Virtual IP.

回答3:

RabbitMQ also offers two ways to deal with network partitions automatically: pause-minority mode and autoheal mode. (The default behaviour is referred to as ignore mode).

In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run.

In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred. It will restart all nodes that are not in the winning partition. The winning partition is the one which has the most Automatically handling partitions clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).

You can enable either mode by setting the configuration parameter cluster_partition_handling for the rabbit application in your configuration file to either pause_minority or autoheal.

Which mode should I pick?

It's important to understand that allowing RabbitMQ to deal with network partitions automatically does not make them less of a problem. Network partitions will always cause problems for RabbitMQ clusters; you just get some degree of choice over what kind of problems you get. As stated in the introduction, if you want to connect RabbitMQ clusters over generally unreliable links, you should use the federation plugin or the shovel plugin.

With that said, you might wish to pick a recovery mode as follows:

ignore: Your network really is reliable. All your nodes are in a rack, connected with a switch, and that switch is also the route to the outside world. You don't want to run any risk of any of your cluster shutting down if any other part of it fails (or you have a two node cluster).
pause_minority: Your network is maybe less reliable. You have clustered across 3 AZs in EC2, and you assume that only one AZ will fail at once. In that scenario you want the remaining two AZs to continue working and the nodes from the failed AZ to rejoin automatically and without fuss when the AZ comes back.
autoheal: Your network may not be reliable. You are more concerned with continuity of service than with data integrity. You may have a two node cluster.

This answer is ref from rabbitmq docs. https://www.rabbitmq.com/partitions.html will give you a more detailed description.