Cannot talk to ZooKeeper - Updates are disabled

2019-04-26 17:22发布

问题:

We are facing one peculiar issue with Zoo Keeper wherein the ZK loses connectivity with solr cloud all of a sudden and starts throwing an Exception which says "Cannot talk to ZooKeeper - Updates are disabled."

Our application has 2 solr clusters setup separately on 2 different data centers. Each of these clusters has the same configurations and data and is expected to take the same incremental load. Application users need the changes made by them to reflect in the search with almost immediate effect and hence we run the incremental load every 10 seconds. Having said that the data updates within this 10 seconds will not got beyond 10000 in ideal scenarios.

3 Zoo Keepers are setup in a quorum with dedicated servers for each data center. Now, with such set we have recently encountered the issue mentioned earlier in one of the data centers. ZK goes down all of a sudden and fails to recover by itself. Strangely, this happened on only one data center while both DC's share the same load.

Though it was not impacting searching the index, it was bombarding the application team with failure notifications (because of an application specific notification setup).

What have we done to handle this? A: To stop the flooding of mails, we have stopped the incremental jobs for about 5 mins and then resumed them.

What we have observed? (Could be wrong understanding as well. Please correct) A: Stopping the jobs allowed ZK sometime to recover itself which allowed the incremental jobs to run normally when resumed. No restart of either ZK of Solr Cloud was needed.

What we would like to know? Q: There was nothing unusual happening in terms of overloading the ZK during the time. Then what could have possibly caused the ZK to shutdown itself?

It would be of great help if anyone can help me understand the root cause of this unexpected behavior.

Thanks in advance!