How can I make Jgroups reconnect even after a long

2019-09-08 18:17发布

问题:

So We have a problem where a penetration checker being run for something like 12 hours is causing Jgroups to disconnect, the slave doesn't rejoin the cluster, split brain, some other issues that represent the lack of replication, and it doesn't recover.

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.6.xsd">
   <TCP bind_addr="NON_LOOPBACK"
        bind_port="${infinispan.jgroups.bindPort}"
        enable_diagnostics="false"
        thread_naming_pattern="pl"
        send_buf_size="640k"
        sock_conn_timeout="300"

        thread_pool.min_threads="${jgroups.thread_pool.min_threads:2}"
        thread_pool.max_threads="${jgroups.thread_pool.max_threads:30}"
        thread_pool.keep_alive_time="60000"
        thread_pool.queue_enabled="false"

        internal_thread_pool.min_threads="${jgroups.internal_thread_pool.min_threads:5}"
        internal_thread_pool.max_threads="${jgroups.internal_thread_pool.max_threads:20}"
        internal_thread_pool.keep_alive_time="60000"
        internal_thread_pool.queue_enabled="true"
        internal_thread_pool.queue_max_size="500"

        oob_thread_pool.min_threads="${jgroups.oob_thread_pool.min_threads:20}"
        oob_thread_pool.max_threads="${jgroups.oob_thread_pool.max_threads:200}"
        oob_thread_pool.keep_alive_time="60000"
        oob_thread_pool.queue_enabled="false"
   />
   <TCPPING async_discovery="true"
            initial_hosts="${infinispan.jgroups.tcpping.initialhosts}"
            port_range="1"/>
   />
   <MERGE3 min_interval="10000" 
           max_interval="30000" 
   />
   <FD_SOCK />
   <FD />
   <VERIFY_SUSPECT />
   <pbcast.NAKACK2 use_mcast_xmit="false"
                   xmit_interval="1000"
                   xmit_table_num_rows="50"
                   xmit_table_msgs_per_row="1024"
                   xmit_table_max_compaction_time="30000"
                   max_msg_batch_size="100"
                   resend_last_seqno="true"
   />
   <UNICAST3 xmit_interval="500"
             xmit_table_num_rows="50"
             xmit_table_msgs_per_row="1024"
             xmit_table_max_compaction_time="30000"
             max_msg_batch_size="100"
             conn_expiry_timeout="0"
   />
   <pbcast.STABLE stability_delay="500"
                  desired_avg_gossip="5000"
                  max_bytes="1M"
   />
   <pbcast.GMS print_local_addr="true"  join_timeout="15000"/>
   <pbcast.FLUSH />
   <FRAG2 />
</config>

versions

jgroups 3.6.13
infinispan 8.1.0, 
hibernate search 5.3

I'm wondering if we can change our jgroups configuration so that the cluster node will eventually be able to rejoin. Even after 12 hours of "attack" so that we don't have to restart the servers.

回答1:

Define disconnect for me first, please!

Regarding your stack, I have a few suggestions / questions:

  • I suggest in general to use tcp.xml from the version you use and then modify it according to your needs
  • TCPPING: does initial_hosts contain all cluster members?
  • Replace FD with FD_ALL
  • STABLE: desired_avg_gossip of 5s is a bit small; this generates more traffic than needed
  • GMS.join_timeout of 15s is quite high; this is the startup time of the first member, and it also influences discovery time
  • What do you need FLUSH for?


标签: java jgroups