I would like to run a large cluster of nodes in the cloud (AWS, Heroku, or maybe self-manged VMS), whose clocks must be synchronized with a predefined tolerance in mind. I'm looking for a tolerance of maybe 200 ms. That means if I have 250 nodes, the largest clock difference between any of the 250 nodes should never exceed 200 ms. I don't really care about the actual date / time with respect to the world. The solution has to be fault tolerant, and should not need to rely on the accuracy of the clock of any one system -- in fact, it's likely that none of the clocks will be terribly accurate.
The requirement is strong enough where if for any reason the clock synchronization is determined to be unreliable for any particular node, that I'd prefer to drop a node from the cluster due to clock desynchronization -- so on any suspected failure, I'd like to be able to perform some type of controlled shutdown of that node.
I'd love to use something like NTP, but according to the NTP known issues twiki:
NTP was not designed to run inside of a virtual machine. It requires a high resolution system clock, with response times to clock interrupts that are serviced with a high level of accuracy. No known virtual machine is capable of meeting these requirements.
And although the same twiki then goes to describe various ways of addressing the situation (such as running ntp on the host OS), I don't believe I'll have the ability to modify the environment enough using AWS or on horoku to comply with the workarounds.
Even if I was not running in VM's, a trusted operations manager who has years of experience running ntp tells me that ntp can and will drop synchronization (or plain get the time wrong) due to bad local clock drift every once in a while. It doesn't happen often, but it does happen, and as you increase machines, you increase your chances of this happening. AFAIK, detecting how far off you are requires stopping ntpd, running a query mode command, and starting it back up again, and it can take a long time to get an answer back.
To sum up -- I need a clock synchronization whose primary goal is as follows:
- Runs well in VM's where operational control is limited (ie: "cloud service providers")
- Time tolerances in the cluster at around 200ms between all participants
- Ability to detect bad node and react to that in an active way
- Fault tolerant (no single point of failure)
- Scalable (the thing can't fall over when you add more nodes -- definitely avoid n^2)
- Could support hundreds of nodes
- None of the nodes should be considered having superior notion of time over any other node
- It's OK for the entire cluster to drift (within reason) -- as long as it drifts in unison
From the description, it seems like the Berkeley Algorithm might be the right choice here, but is it already implemented?
Nice to haves:
- Minimal configuration (nodes auto register to participate) -- important for spinning up new nodes
- HTML dashboard or (REST?) API that reports the nodes that are participating in the clock synchronization and what the relative time offsets are
- Pretty graphs?