This is a follow-up to this question where I posted this program:
#include <algorithm>
#include <cstdlib>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <iomanip>
#include <iostream>
#include <vector>
#include <chrono>
class Stopwatch
{
public:
typedef std::chrono::high_resolution_clock Clock;
//! Constructor starts the stopwatch
Stopwatch() : mStart(Clock::now())
{
}
//! Returns elapsed number of seconds in decimal form.
double elapsed()
{
return 1.0 * (Clock::now() - mStart).count() / Clock::period::den;
}
Clock::time_point mStart;
};
struct test_cast
{
int operator()(const char * data) const
{
return *((int*)data);
}
};
struct test_memcpy
{
int operator()(const char * data) const
{
int result;
memcpy(&result, data, sizeof(result));
return result;
}
};
struct test_memmove
{
int operator()(const char * data) const
{
int result;
memmove(&result, data, sizeof(result));
return result;
}
};
struct test_std_copy
{
int operator()(const char * data) const
{
int result;
std::copy(data, data + sizeof(int), reinterpret_cast<char *>(&result));
return result;
}
};
enum
{
iterations = 2000,
container_size = 2000
};
//! Returns a list of integers in binary form.
std::vector<char> get_binary_data()
{
std::vector<char> bytes(sizeof(int) * container_size);
for (std::vector<int>::size_type i = 0; i != bytes.size(); i += sizeof(int))
{
memcpy(&bytes[i], &i, sizeof(i));
}
return bytes;
}
template<typename Function>
unsigned benchmark(const Function & function, unsigned & counter)
{
std::vector<char> binary_data = get_binary_data();
Stopwatch sw;
for (unsigned iter = 0; iter != iterations; ++iter)
{
for (unsigned i = 0; i != binary_data.size(); i += 4)
{
const char * c = reinterpret_cast<const char*>(&binary_data[i]);
counter += function(c);
}
}
return unsigned(0.5 + 1000.0 * sw.elapsed());
}
int main()
{
srand(time(0));
unsigned counter = 0;
std::cout << "cast: " << benchmark(test_cast(), counter) << " ms" << std::endl;
std::cout << "memcpy: " << benchmark(test_memcpy(), counter) << " ms" << std::endl;
std::cout << "memmove: " << benchmark(test_memmove(), counter) << " ms" << std::endl;
std::cout << "std::copy: " << benchmark(test_std_copy(), counter) << " ms" << std::endl;
std::cout << "(counter: " << counter << ")" << std::endl << std::endl;
}
I noticed that for some reason std::copy
performs much worse than memcpy. The output looks like this on my Mac using gcc 4.7.
g++ -o test -std=c++0x -O0 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast: 41 ms
memcpy: 46 ms
memmove: 53 ms
std::copy: 211 ms
(counter: 3838457856)
g++ -o test -std=c++0x -O1 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast: 8 ms
memcpy: 7 ms
memmove: 8 ms
std::copy: 19 ms
(counter: 3838457856)
g++ -o test -std=c++0x -O2 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast: 3 ms
memcpy: 2 ms
memmove: 3 ms
std::copy: 27 ms
(counter: 3838457856)
g++ -o test -std=c++0x -O3 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast: 2 ms
memcpy: 2 ms
memmove: 3 ms
std::copy: 16 ms
(counter: 3838457856)
As you can see, even with -O3
it is up to 5 times (!) slower than memcpy.
The results are similar on Linux.
Does anyone know why?
Looks to me like the answer is that gcc can optimize these particular calls to memmove and memcpy, but not std::copy. gcc is aware of the semantics of memmove and memcpy, and in this case can take advantage of the fact that the size is known (sizeof(int)) to turn the call into a single mov instruction.
std::copy is implemented in terms of memcpy, but apparently the gcc optimizer doesn't manage to figure out that data + sizeof(int) - data is exactly sizeof(int). So the benchmark calls memcpy.
I got all that by invoking gcc with
-S
and flipping quickly through the output; I could easily have gotten it wrong, but what I saw seems consistent with your measurements.By the way, I think the test is more or less meaningless. A more plausible real-world test might be creating an actual
vector<int> src
and anint[N] dst
, and then comparingmemcpy(dst, src.data(), sizeof(int)*src.size())
withstd::copy(src.begin(), src.end(), &dst)
.I agree with @rici's comment about developing a more meaningful benchmark so I rewrote your test to benchmark copying of two vectors using
memcpy()
,memmove()
,std::copy()
and thestd::vector
assignment operator:I went a little overboard with C++11 just for fun.
Here are the results I get on my 64 bit Ubuntu box with g++ 4.6.3:
The results are all quite comparable! I get comparable times in all test cases when I change the integer type, e.g. to
long long
, in the vector as well.Unless my benchmark rewrite is broken, it looks like your own benchmark isn't performing a valid comparison. HTH!
memcpy
andstd::copy
each have their uses,std::copy
should(as pointed out by Cheers below) be as slow as memmove because there is no guarantee the memory regions will overlap. This means you can copy non-contiguous regions very easily (as it supports iterators) (think of sparsely allocated structures like linked list etc.... even custom classes/structures that implement iterators).memcpy
only work on contiguous reasons and as such can be heavily optimized.According to assembler output of G++ 4.8.1,
test_memcpy
:test_std_copy
:As you can see,
std::copy
successfully recognized that it can copy data withmemcpy
, but for some reason further inlining did not happen - so that is the reason of performance difference.By the way, Clang 3.4 produces identical code for both cases:
That is not the results I get:
On a Linux box I get different results: