why the release version memset is slower than debu

2019-01-19 22:49发布

问题:

why the release version memset is slower than debug version in visual studio 2012? in visual sutido 2010, it is that result too. my computer:

Intel Core i7-3770 3.40GHz 8G memory os: windows 7 sp1 64bit

this is my test code:

#include <boost/progress.hpp>

int main()
{
    const int Size = 1000*1024*1024;
    char* Data = (char*)malloc(Size);

#ifdef _DEBUG
    printf_s("debug\n");
#else
    printf_s("release\n");
#endif

    boost::progress_timer timer;
    memset(Data, 0, Size);

    return 0;
}

the output:

release
0.27 s

debug
0.06 s

edited:

if i change code to this, it will get the same result:

#include <boost/progress.hpp>

int main()
{
    const int Size = 1000*1024*1024;
    char* Data = (char*)malloc(Size);
    memset(Data, 1, Size);

#ifdef _DEBUG
    printf_s("debug\n");
#else
    printf_s("release\n");
#endif

    {
        boost::progress_timer timer;
        memset(Data, 0, Size);
    }    

    return 0;
}

so Hans Passant is right, thank you very much.

回答1:

This is a standard benchmark mistake, you don't measure the execution time of memset() at all. You actually measure the time needed for the operating system to deal with the quarter of a million page faults that your code generates. Which is highly dependent on what other processes are running and how many pages were prepped by the kernel's zero page thread.

On a demand-page virtual memory operating system like Windows, malloc() doesn't allocate memory at all. It allocates address space. Just numbers to the processor. The physical memory allocation doesn't happen until the processor accesses the address space. At which point the kernel is forced to provide the physical RAM to allow the processor to continue. Triggered by a soft page fault generated by the processor when it discovers that an address isn't mapped to RAM yet.

If you want to have an estimate of how long memset() really takes then you have to call it twice. The first call ensures that the RAM is mapped. Time the second call to measure how long the memory writes take. Which is a fixed number for large memory ranges like you are using, the memory cache and write-back buffers are ineffective so speed is entirely determined by the bandwidth of the memory bus. Your debug result suggests DDR3 clocked at 266 MHz, pretty common.

This also removes the bias you get from using the debug allocator in the debug build of the CRT. Which fills allocated memory with a bit-pattern that's likely to induce a crash when you try to access uninitialized memory. This hides the page fault overhead since you didn't include the cost of malloc() in the measurement.