Why is Skylake so much better than Broadwell-E for

2019-01-03 11:07发布

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.

Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.

Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:

64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E

32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.

We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?

We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.

#include <memory>
#include <Windows.h>
#include <iostream>

//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
    memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}

int main()
{
    const int SIZE_OF_BLOCKS = 25000000;
    const int NUMBER_ITERATIONS = 100;
    void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
    void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
    LARGE_INTEGER Frequency;
    QueryPerformanceFrequency(&Frequency);
    while (true)
    {
        LONGLONG total = 0;
        LONGLONG max = 0;
        LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
        for (int i = 0; i < NUMBER_ITERATIONS; ++i)
        {
            QueryPerformanceCounter(&StartingTime);
            MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
            QueryPerformanceCounter(&EndingTime);
            ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
            ElapsedMicroseconds.QuadPart *= 1000000;
            ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
            total += ElapsedMicroseconds.QuadPart;
            max = max(ElapsedMicroseconds.QuadPart, max);
        }
        std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
        std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
    }
    getchar();
}

2条回答
The star\"
2楼-- · 2019-01-03 11:23

I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.

It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.

I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.

I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.

查看更多
一纸荒年 Trace。
3楼-- · 2019-01-03 11:41

Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).

Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).

SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)


A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.

For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)

Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).

查看更多
登录 后发表回答