I would like to run a large cluster of nodes in the cloud (AWS, Heroku, or maybe self-manged VMS), whose clocks must be synchronized with a predefined tolerance in mind. I'm looking for a tolerance of maybe 200 ms. That means if I have 250 nodes, the largest clock difference between any of the 250 nodes should never exceed 200 ms. I don't really care about the actual date / time with respect to the world. The solution has to be fault tolerant, and should not need to rely on the accuracy of the clock of any one system -- in fact, it's likely that none of the clocks will be terribly accurate.
The requirement is strong enough where if for any reason the clock synchronization is determined to be unreliable for any particular node, that I'd prefer to drop a node from the cluster due to clock desynchronization -- so on any suspected failure, I'd like to be able to perform some type of controlled shutdown of that node.
I'd love to use something like NTP, but according to the NTP known issues twiki:
NTP was not designed to run inside of a virtual machine. It requires a high resolution system clock, with response times to clock interrupts that are serviced with a high level of accuracy. No known virtual machine is capable of meeting these requirements.
And although the same twiki then goes to describe various ways of addressing the situation (such as running ntp on the host OS), I don't believe I'll have the ability to modify the environment enough using AWS or on horoku to comply with the workarounds.
Even if I was not running in VM's, a trusted operations manager who has years of experience running ntp tells me that ntp can and will drop synchronization (or plain get the time wrong) due to bad local clock drift every once in a while. It doesn't happen often, but it does happen, and as you increase machines, you increase your chances of this happening. AFAIK, detecting how far off you are requires stopping ntpd, running a query mode command, and starting it back up again, and it can take a long time to get an answer back.
To sum up -- I need a clock synchronization whose primary goal is as follows:
- Runs well in VM's where operational control is limited (ie: "cloud service providers")
- Time tolerances in the cluster at around 200ms between all participants
- Ability to detect bad node and react to that in an active way
- Fault tolerant (no single point of failure)
- Scalable (the thing can't fall over when you add more nodes -- definitely avoid n^2)
- Could support hundreds of nodes
- None of the nodes should be considered having superior notion of time over any other node
- It's OK for the entire cluster to drift (within reason) -- as long as it drifts in unison
From the description, it seems like the Berkeley Algorithm might be the right choice here, but is it already implemented?
Nice to haves:
- Minimal configuration (nodes auto register to participate) -- important for spinning up new nodes
- HTML dashboard or (REST?) API that reports the nodes that are participating in the clock synchronization and what the relative time offsets are
- Pretty graphs?
After struggling with NTP on VMs for so many months, we have switched using the chrony https://chrony.tuxfamily.org. I have found it to be far superior to ntpd in so many respects (configuration, control, documentation, handling issues where vm clock drifts often and drastically).
Use chrony and don't look back :)
Since the FAQ for NTP specifically states why NTP time sync doesn't work 'right' under virtual machines, it's probably an insurmountable problem.
Most machines have a RTC (real-time clock) in them, on PCs its how you store the time so that you have a 'rough' guess as to what the time is if ntp is unavailable, once the system is loaded there's a 'tick' clock that is higher resolution - thats what NTP sets.
That tick clock is subject to the drift of the virtual machine since ticks may or may not happen at the correct intervals - any time mechanism you attempt to use is going to be subject to that drift.
It's probably suboptimal design to try to enforce ntp synchronization on virtual machines, if machine A and B have a delta of 200ms, and machine B and C have a delta of 200ms, C could 400ms away from A. You can't control that.
You're better off using a centralized messaging system like zeromq to keep everybody in sync with the job queue, it's going to be more overhead, but relying on system tick time is a dodgy affair at best. There are many clustering solutions that account for cluster participation using all sorts of reliable mechanisms to ensure that everyone is in sync, take a look at corosync or spread - they've solved this already for things like two-phase-commits.
Incidentally, ntp 'giving up' when drift is too high can be circumvented by instructing it to 'slam' the time to the new value rather than 'slew'. By default ntp will incrementally update the system time to account for its drift from 'real time'. I forget how to configure this in ntpd, but if you use ntpdate the flag is -B