Zookeeper timeouts without error in zookeeper Solr

2019-08-24 10:31发布

问题:

We are facing issue with solr/zookeeper where zookeeper timeouts after 10000ms. Error below.

SolrException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper <server1>:9181,<server2>:9182,<server2>:9183 within 10000 ms.
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:184)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:121)

We are not getting any error in zookeeper logs.Except below logs

2018-12-19 04:35:22,305 [myid:2] - INFO  [SessionTracker:ZooKeeperServer@354] - Expiring session 0x200830234de3127, timeout of 10000ms exceeded
2018-12-19 05:35:38,304 [myid:2] - INFO  [SessionTracker:ZooKeeperServer@354] - Expiring session 0x200b4f912730086, timeout of 10000ms exceeded

During the issue threads go high and we could notice below in weblogic server.

Name: Connection evictor
State: TIMED_WAITING
Total blocked: 0  Total waited: 1
Stack trace: 
java.lang.Thread.sleep(Native Method)
org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
java.lang.Thread.run(Thread.java:748)

What could be going wrong here?

回答1:

In my experience, ZK timeouts have almost always been due to something on the Solr node, rather than a problem in ZK.

You don't provide all the timestamps, but the theory is that:

  1. Solr fails to send the heartbeat for some reason
  2. ZK assumes the client has gone away and closes the connection
  3. Solr tries to use the connection that ZK closed

So why might the Solr node fail to send the heartbeat? This could be because the Solr node was simply overloaded, (Is the thread spike a cause, or a symptom?) or just working through a very long GC pause could do it too.