Clock synchronization quality on Windows Azure?

2019-02-02 21:46发布

问题:

I am looking for quantitative estimates on clock offsets between VMs on Windows Azure - assuming that all VMs are hosted in the same datacenter. I am guesstimating that average clock offset between one VM and another is below 10 seconds, but I am not even sure it's guaranteed property of the Azure cloud.

Has anybody some quantitative measurements on that matter?

回答1:

I have finally settled to do some experiments on my own.

A few facts concerning the experiment protocol:

  • Instead of looking for offset to an reference clock, I have simply checked clock differences between Azure VMs and the Azure Storage.
  • Clock time of the Azure Storage has been retrieved using the HTTP hack pasted below.
  • Measurements have been done within the North Europe datacenter of Azure with 250 small VMs.
  • Latency between storage and VMs measured with Stopwatch was always lower than 1ms for minimalistic unauthenticated requests (basically HTTP requests were coming back with 400 errors, but still with Date: available in the HTTP headers).

Results:

  • About 50% of the VMs have a clock offset to the storage greater than 1s.
  • About 5% of the VMs have a clock offset to the storage greater than 2s.
  • Less than 1% observations for clock offsets close 3s.
  • A handfew outliers close to 4s.
  • The clock offset between a single VM and the storage typically vary of +1/-1s from one request to the next.

So technically, we are not too far from the 2s tolerance target, although for intra-data-center sync, you don't have to push the experiment far to observe close to 4s offset. If we assume a normal (aka Gaussian) distribution for the clock offsets, then I would say that relying on any clock threshold lower than 6s is bound to lead to scheduling issues.

/// <summary>
/// Substitute for proper NTP (Network Time Protocol) 
/// when UDP is not available, as on Windows Azure.
/// </summary>
public class HttpTimeChecker
{
    public static DateTime GetUtcNetworkTime(string server)
    {
        // HACK: we can't use WebClient here, because we get a faulty HTTP response
        // We don't care about HTTP error, the only thing that matter is the presence
        // of the 'Date:' HTTP header
        var tc = new TcpClient();
        tc.Connect(server, 80);

        string response;
        using (var ns = tc.GetStream())
        {
            var sw = new StreamWriter(ns);
            var sr = new StreamReader(ns);

            string req = "";
            req += "GET / HTTP/1.0\n";
            req += "Host: " + server + "\n";
            req += "\n";

            sw.Write(req);
            sw.Flush();

            response = sr.ReadToEnd();
        }

        foreach(var line in response.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries))
        {
            if(line.StartsWith("Date: "))
            {
                return DateTime.Parse(line.Substring(6)).ToUniversalTime();
            }
        }

        throw new ArgumentException("No date to be retrieved among HTTP headers.", "server");
    }
}


回答2:

I've been in conversation with someone from the Azure product team regarding clock synchronisation recently, more out of interest than anything else. The most recent reply I've received is:

The VMs and services take their time directly from the underlying Hyper-V platform upon boot and from that point forward the clock is maintained by the service. In order to have true time sync across a distributed system you will need to do this at the application layer and/or with a service referencing an singular time server.



回答3:

Based on my experience, I would not rely on the system clock of the Azure VMs for anything critical. I have occasionally seen differences up to several minutes, which does fly in the face of what you'd expect.



回答4:

This is the classic problem of both distributed systems and virtual machines - clock skew.

One possible solution would be to use the Azure scheduler to ping an endpoint on each of your VM that would reset your clock - or at least tell you what the diff would be. That way, your skew would not grow, and you may even be able to calculate an offset for the communication delay. This way, you'd get to within milliseconds and not seconds.

Ofcourse, you could also go the other way, and have a service on the VM that periodically manages the clock by pinging out to some time server. I'm not sure if the hypervisor will let you mess with it's clock, but all you really need is an offset for your apps to consume.

Overall... never trust the clock on a VM, and certainly not over a distributed system. Note that this clock issue is part of active research in many universities. ie. https://scholar.google.com/scholar?hl=en&q=distributed+system+clock&btnG=&as_sdt=1%2C48&as_sdtp=



回答5:

I've tried to search for an answer to this specific question - but haven't succeeded!

Some references I have found about the "Windows Time Service" - W32Time - reference that the design for the Windows service targets a tolerance of 2 seconds - e.g.

  • http://www.windowsitpro.com/article/time-synchronization/windows-time-synchronization-service
  • http://support.microsoft.com/kb/939322

In practice within the Azure network I expect that the synchronisation achieved should be much better than this - but my search turned up no referenced guarantees on this.



回答6:

You can never trust clocks synchronization if you are building distributed system unless special hardware measures are used as for example in Google Spanner. Even there a special algorithm is used to resolve possible clock skew conflicts. However, there are many algorithms, which allow to solve this problem in distributed systems: logical clocks, vector clocks, Lamport timestamps to name a few. See classical book "Distributed Systems: Principles and Paradigms" by Andrew Tanenbaum.