We're currently using Redis 2.8.4 and StackExchange.Redis (and loving it) but don't have any sort of protection against hardware failures etc at the moment. I'm trying to get the solution working whereby we have master/slaves and sentinel monitoring but can't quite get there and I'm unable to find any real pointers after searching.
So currently we have got this far:
We have 3 redis servers and sentinel on each node (setup by the Linux guys): devredis01:6383 (master) devredis02:6383 (slave) devredis03:6383 (slave) devredis01:26379 (sentinel) devredis02:26379 (sentinel) devredis03:26379 (sentinel)
I am able to connect the StackExchange client to the redis servers and write/read and verify that the data is being replicated across all redis instances using Redis Desktop Manager.
I can also connect to the sentinel services using a different ConnectionMultiplexer, query the config, ask for master redis node, ask for slaves etc.
We can also kill the master redis node and verify that one of the slaves is promoted to master and replication to the other slave continues to work. We can observe the redis connection trying to reconnect to the master, and also if I recreate the ConnectionMultiplexer I can write/read again to the newly promoted master and read from the slave.
So far so good!
The bit I'm missing is how do you bring it all together in a production system?
Should I be getting the redis endpoints from sentinel and using 2 ConnectionMultiplexers? What exactly do I have to do to detect that a node has gone down? Can StackExchange do this for me automatically or does it pass an event so I can reconnect my redis ConnectionMultiplexer? Should I handle the ConnectionFailed event and then reconnect in order for the ConnectionMuliplexer to find out what the new master is? Presumably while I am reconnecting any attempts to write will be lost?
I hope I'm not missing something very obvious here I'm just struggling to put it all together.
Thanks in advance!
I am including our Redis wrapper, it has changed somewhat from the original answer, for various reasons:
I rather suspect this is down to our sentinel/redis configuration more than anything else. Either way, it just wasn't perfectly reliable despite destructive testing. Added to which, the master changed message took a long time since we had to increase timeouts due to sentinel being "too sensitive" and calling failovers when there weren't any. I think running in a virtual environment also exacerbated the problem.
Instead of listening to subscriptions now we simply attempt a write test every 5 seconds, and also have a "last message received" check for pub/sub. If we encounter any problems we strip down completely the connections and rebuild them. It seems overkill but actually it's pretty fast and still faster than waiting for the master changed message from sentinel...
This won't compile without various extension methods and other classes etc, but you get the idea.
I was able to spend some time last week with the Linux guys testing scenarios and working on the C# side of this implementation and am using the following approach:
I find that generally I am working and reconfigured after about 5 seconds of losing the redis master. During this time I can't write but I can read (since you can read off a slave). 5 seconds is ok for us since our data updates very quickly and becomes stale after a few seconds (and is subsequently overwritten).
One thing I wasn't sure about was whether or not I should remove the redis server from the redis ConnectionMultiplexer when an instance goes down, or let it continue to retry the connection. I decided to leave it retrying as it comes back into the mix as a slave as soon as it comes back up. I did some performance testing with and without a connection being retried and it seemed to make little difference. Maybe someone can clarify whether this is the correct approach.
Every now and then bringing back an instance that was previously a master did seem to cause some confusion - a few seconds after it came back up I would receive an exception from writing - "READONLY" suggesting I can't write to a slave. This was rare but I found that my "catch-all" approach of calling Configure() 12 seconds after a connection state change caught this problem. Calling Configure() seems very cheap and therefore calling it twice regardless of whether or not it's necessary seemed OK.
Now that I have slaves I have offloaded some of my data cleanup code which does key scans to the slaves, which makes me happy.
All in all I'm pretty satisfied, it's not perfect but for something that should very rarely happen it's more than good enough.
I just asked this question, and found a similar question to yours and mine which I believe answers the question of how does our code (the client) know now which is the new master server when the current master goes down?
How to tell a Client where the new Redis master is using Sentinel
Apparently you just have to subscribe and listen to events from the Sentinels. Makes sense.. I just figured there was a more streamlined way.
I read something about the Twemproxy for Linux which acts as a proxy and probably does this for you? But I was on redis for Windows and was trying to find a Windows option. We might just moved to Linux if that's the approved way it should be done.