JBoss Ehcache Replication Exception (sender not fo

So, I just setup two jboss nodes behind apache, enabled clustering and setup ehcache synchronization. Now with both nodes running, I get the following exception on the node that did not receive the request:

...
JBoss_5_1_0_GA date=200905221634)] Started in 2m:16s:391ms
12:52:51,139 ERROR [NAKACK] sender 10.166.17.53:7600 not found in xmit_table
12:52:51,139 ERROR [NAKACK] range is null
12:52:51,145 INFO  [RPCManagerImpl] Received new cluster view: MergeView::[10.16                 6.17.52:7600|1] [10.166.17.52:7600, 10.166.17.53:7600], subgroups=[[10.166.17.52                       :7600|0] [10.166.17.52:7600], [10.166.17.53:7600|0] [10.166.17.53:7600]]
12:53:10,006 WARN  [NAKACK] 10.166.17.52:7600] discarded message from non-member                        10.166.17.53:7600, my view is [10.166.17.52:7600|0] [10.166.17.52:7600]
12:53:10,108 WARN  [NAKACK] 10.166.17.52:7600] discarded message from non-member                        10.166.17.53:7600, my view is [10.166.17.52:7600|0] [10.166.17.52:7600]
12:53:10,110 ERROR [NAKACK] sender 10.166.17.53:7600 not found in xmit_table
12:53:10,110 ERROR [NAKACK] range is null
12:53:10,113 INFO  [graCluster] New cluster view for partition graCluster (id: 1                       , delta: 1) : [127.0.0.1:1099, 127.0.0.1:1099]
12:53:10,117 INFO  [graCluster] Merging partitions...
12:53:10,118 INFO  [graCluster] Dead members: 0
12:53:10,120 INFO  [graCluster] Originating groups: [[10.166.17.52:7600|0] [10.1                       66.17.52:7600], [10.166.17.53:7600|0] [10.166.17.53:7600]]

Following is what my ehcache.xml looks like:

<cacheManagerPeerProviderFactory
       class="net.sf.ehcache.distribution.jgroups.JGroupsCacheManagerPeerProviderFactory"
       properties="connect=TCP(start_port=7800):TCPPING(initial_hosts=10.46.49.52[7800],10.46.49.53[7800];port_range=10;timeout=3000;
                    num_initial_members=2;up_thread=true;down_thread=true):
                    VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
                    pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
                    pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;
                    print_local_addr=false;down_thread=true;up_thread=true)"
                    propertySeparator="::"/>

Finally this is how I run both the nodes:

./run.sh -c all -g myCluster -Djboss.default.jgroups.stack=tcp -Djgroups.tcpping.initial_hosts=10.166.17.52[7600],10.166.17.53[7600] -Djboss.messaging.ServicePeerId=1 -Djgroups.bind_addr=10.166.17.52 -Djboss.node.name=node1 -b 0.0.0.0

and

./run.sh -c all -g myCluster-Djboss.default.jgroups.stack=tcp -Djgroups.tcpping.initial_hosts=10.166.17.52[7600],10.166.17.53[7600] -Djboss.messaging.ServicePeerId=2 -Djgroups.bind_addr=10.166.17.53 -Djboss.node.name=node2 -b 0.0.0.0

The servers are trying to talk to each other. I am not sure whether they are even in the same cluster or not. Any help will be much appreciated.

回答1:

I turned on ehcache logging and figured out that although the nodes attempt to talk to each other, they fail and could not establish connection to each other. This was resolved by fixing a badly configured host file. Once the nodes started talking to each other, ehcache replication worked. Apparently the error about xmit_table was inconsequential.

回答2:

Ran into this problem recently while doing a POC on TCP based discovery and replication of EHCache across windows machines. Running 2 instances of service locally was working fine when used IP address as the bind address -Djgroups.bind_addr=. But it failed when connecting across machines. We don't have access to alter the host file, so rather changed the bind address to use the machine name instead of IP. Restarted the services and communication across machines worked perfectly fine with all CRUD operations on the cache getting replicated as expected.