I am new to Solr and am experimenting with SolrCloud - and it seems that ZooKeeper is the best way to manage high availability.
However, in our production environment we only have two servers (active-active) and I am concerned that Zookeeper is not ideal on two servers because if either of them goes down the whole ensemble stops working. The workaround so far is to run two ZKs on server1 and one ZK on server2, so that at least if server2 goes down we still have quorum (but if server1 goes down, game over).
What is the best practice / recommended solution for Solr in this scenario? Can it automatically replicate/fail over with SolrCloud between 2 servers without using zookeeper? Or is there some way to use Zookeeper (or another tool?) so that it is robust over 2 servers? Or do I have to go back to using the legacy-mode replication?
Thanks!
You are going to need more than 2 servers. A production Zookeeper ensemble needs at least 3 instances and should always be an odd number:
Three ZooKeeper servers is the minimum recommended size for an
ensemble, and we also recommend that they run on separate machines.
For reliable ZooKeeper service, you should deploy ZooKeeper in a
cluster known as an ensemble. As long as a majority of the ensemble
are up, the service will be available. Because Zookeeper requires a
majority, it is best to use an odd number of machines. For example,
with four machines ZooKeeper can only handle the failure of a single
machine; if two machines fail, the remaining two machines do not
constitute a majority. However, with five machines ZooKeeper can
handle the failure of two machines.
http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html
Running 2 instances on 1 server doesn't really cut it, as losing that server will kill the cluster. Solr Cloud requires Zookeeper - you can't get around it.
Setting Up an External ZooKeeper Ensemble
Although Solr comes bundled with Apache ZooKeeper, you should consider
yourself discouraged from using this internal ZooKeeper in production,
because shutting down a redundant Solr instance will also shut down
its ZooKeeper server, which might not be quite so redundant. Because a
ZooKeeper ensemble must have a quorum of more than half its servers
running at any given time, this can be a problem.
The solution to this problem is to set up an external ZooKeeper
ensemble.
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
Generally speaking trying to run truly distributed, large scale processing with less than 3 servers is a bad idea - Zookeeper is not unique in it's requirement for at least 3 servers to support reliable operation if a server fails. Generally you need a quorum of surviving servers (N/2+1) to function, so you need to start with at least 3.