Comparing performance of two copying techniques?

For copying a huge double array to another array I have following two options:

Option 1

copy(arr1, arr1+N, arr2);

Option 2

#pragma omp parallel for
for(int i = 0; i < N; i++)
    arr2[i] = arr1[i];

I want to know for a large value of N. Which of the following will be the better (takes less time) option and when?"

System configuration:
Memory: 15.6 GiB
Processor: Intel® Core™ i5-4590 CPU @ 3.30GHz × 4
OS-Type: 64-bit
compiler: gcc (Ubuntu 4.9.2-0ubuntu1~12.04) 4.9.2

标签： performance parallel-processing openmp

2条回答

小情绪 Triste *

2楼-- · 2019-06-13 18:08

Practically, if performance matters, measure it.

std::copy and memcpy are usually highly optimized, using sophisticated performance tricks. Your compiler may or may not be clever enough / have the right configuration options to achieve that performance from a raw loop.

That said, theoretically, parallelizing the copy can provide a benefit. On modern systems you must use multiple threads to fully utilize both your memory and cache bandwidth. Take a look at these benchmark results, where the first two rows compare parallel versus single threaded cache, and the last two rows parallel vs. single threaded main memory bandwidth. On a desktop system like yours, the gap is not very large. In a high-performance oriented system, especially with multiple sockets, more threads are very important to exploit the available bandwidth.

For an optimal solution, you have to consider things like not writing the same cache-line from multiple threads. Also if your compiler doesn't produce perfect code from the raw loop, you may have to actually run std::copy on multiple threads/chunks. In my tests, the raw loop performed much worse, because it doesn't use AVX. Only the Intel compiler managed to actually replace parts in the OpenMP loop with an avx_rep_memcpy - interestingly it did not perform this optimization with a non-OpenMP loop. The optimal number of threads for memory bandwidth is also usually not the number of cores, but less.

The general recommendation is: Start with a simple implementation, in this case the idiomatic std::copy, and later analyze your application to understand where the bottleneck actually is. Do not invest in complex, hard to maintain, system specific optimizations that may only affect a tiny faction of your codes overall runtime. If it turns out this is a bottleneck for your application, and your hardware resources are not utilized well, then you need to understand the performance characteristics of your underlying hardware (local/shared caches, NUMA, prefetchers) and tune your code accordingly.

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

3楼-- · 2019-06-13 18:13

Option 1 is better.

RAM is a shared resource, you can not simply parallelize it. When one core uses the RAM, the others wait.

Moreover, RAM is usually slower that the CPU -- RAM frequency is lower than CPU freqency, so in the case above even the single core has cycles that just wait on the RAM.

You also might consider memcpy() for copying, it might be faster than std::copy(). It generally depends from the implementation.

Last but not lest, always measure. For start, just put omp_get_wtime() before and after the piece of code you are measuring and see the difference.

0人赞添加讨论(0) 举报