So We have a problem where a penetration checker being run for something like 12 hours is causing Jgroups to disconnect, the slave doesn't rejoin the cluster, split brain, some other issues that represent the lack of replication, and it doesn't recover.
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.6.xsd">
<TCP bind_addr="NON_LOOPBACK"
bind_port="${infinispan.jgroups.bindPort}"
enable_diagnostics="false"
thread_naming_pattern="pl"
send_buf_size="640k"
sock_conn_timeout="300"
thread_pool.min_threads="${jgroups.thread_pool.min_threads:2}"
thread_pool.max_threads="${jgroups.thread_pool.max_threads:30}"
thread_pool.keep_alive_time="60000"
thread_pool.queue_enabled="false"
internal_thread_pool.min_threads="${jgroups.internal_thread_pool.min_threads:5}"
internal_thread_pool.max_threads="${jgroups.internal_thread_pool.max_threads:20}"
internal_thread_pool.keep_alive_time="60000"
internal_thread_pool.queue_enabled="true"
internal_thread_pool.queue_max_size="500"
oob_thread_pool.min_threads="${jgroups.oob_thread_pool.min_threads:20}"
oob_thread_pool.max_threads="${jgroups.oob_thread_pool.max_threads:200}"
oob_thread_pool.keep_alive_time="60000"
oob_thread_pool.queue_enabled="false"
/>
<TCPPING async_discovery="true"
initial_hosts="${infinispan.jgroups.tcpping.initialhosts}"
port_range="1"/>
/>
<MERGE3 min_interval="10000"
max_interval="30000"
/>
<FD_SOCK />
<FD />
<VERIFY_SUSPECT />
<pbcast.NAKACK2 use_mcast_xmit="false"
xmit_interval="1000"
xmit_table_num_rows="50"
xmit_table_msgs_per_row="1024"
xmit_table_max_compaction_time="30000"
max_msg_batch_size="100"
resend_last_seqno="true"
/>
<UNICAST3 xmit_interval="500"
xmit_table_num_rows="50"
xmit_table_msgs_per_row="1024"
xmit_table_max_compaction_time="30000"
max_msg_batch_size="100"
conn_expiry_timeout="0"
/>
<pbcast.STABLE stability_delay="500"
desired_avg_gossip="5000"
max_bytes="1M"
/>
<pbcast.GMS print_local_addr="true" join_timeout="15000"/>
<pbcast.FLUSH />
<FRAG2 />
</config>
versions
jgroups 3.6.13
infinispan 8.1.0,
hibernate search 5.3
I'm wondering if we can change our jgroups configuration so that the cluster node will eventually be able to rejoin. Even after 12 hours of "attack" so that we don't have to restart the servers.