Use of OpenMP chunk to break cache

I've been trying to increase the performance of my OpenMP solution which often has to deal with nested loops on arrays. Although I've managed to bring it down to 37 from 59 seconds of the serial implementation (on an ageing dual-core Intel T6600) I'm worried that cache synch gets lots of CPU attention (when the CPU should be solving my problem!). I've been fighting to set up the profiler so I haven't verified that claim but my question stands regardless. According to this lecture on load balancing:

Instead of doing work, the CPUs are busy fighting over the only used cache line in the program. You can fix this with a very strange technique: move the CPUs data apart in memory further than one cache line. For example, here we move the integers accessed by each thread 20 units apart.

and then goes on to provide the relevant source code (supposed to run on a quad-core hence the %4)

#pragma omp parallel for schedule(static,1)
    for (unsigned int i=0;i<n;i++) {
    arr[(i%4)*20]++;
}

That said, I have an intuition about what the "chunk" is but the above implementation seems to completely ignore it, leading me to believe my intuition is wrong.

My question is this: Does setting a reasonably large chunk value move data further down the cache lines? Ie. wouldn't the code above be equivalent to

#pragma omp parallel for schedule(static, 20)
    for (unsigned int i=0;i<n;i++) {
    arr[i]++;
}

Both code snippets that you've given are not equivalent as the first one would keep reiterating over the same elements for n larger than 4. The proper way to process such arrays is to ensure that sizeof(arr[0]) * n / #cores is a multiple of the cache line size. Modern x86 CPUs have cache line size of 64 bytes. If arr is an integer or a single-precision floating point array, then sizeof(arr[0]) == 4 and a single cache line holds 16 elements. For data types with double the size a single cache line holds 8 elements. The best loop schedule chunk size is then a multiple of 16 in the former case or 8 in the latter case.

When dealing with statically scheduled loops one tries to maximise the chunk size in order to reduce the number of loops run by each thread. For example, if there are 4 threads, n is 64 and the chunk size is set to 8, the following schedule will be used:

thread    iterations (from-to)
------    --------------------
  0          0- 7, 32-39
  1          8-15, 40-47
  2         16-23, 48-53
  3         24-31, 54-63

This is far from optimal since each thread has to run the loop two times. A much better solution would be to have the chunk size set to 16 (which is a multiple of 8):

thread    iterations (from-to)
------    --------------------
  0             0-15
  1            16-31
  2            32-47
  3            48-63

Note that the default chunk size for statically scheduled loops is #iterations / #threads.

Sometimes one has to process in parallel data that cannot be spread among non-overlapping cache lines. For example arr[] might be simply an array of 4 elements that all fit into a single cache line. In this case one should insert padding between array elements in order to make sure that data being processes by different threads lie in different cache lines. For example:

int arr[4];

#pragma omp parallel for
for (int i = 0; i < 4; i++)
   arr[i]++;

int arr[4] results in the following memory layout:

|<-------- a single cache line ---------->|
| arr[0] | arr[1] | arr[2] | arr[3] | ... |

If core 0 updates arr[0] and core 1 updates arr[1], then the cache line would constantly bounce between the two cores - false sharing and bad performance. Therefore one has to insert padding between arr[0] and arr[1] of size CLS - sizeof(arr[0]) bytes or CLS/sizeof(arr[0]) - 1 array elements, where CLS is the size of the cache line in bytes. With CLS == 64 and sizeof(arr[0]) == 4 this makes 15 elements of padding. The resulting layout would be:

|<----- one cache line ------>|<--- another cache line ---->|<-- yet another ...
| arr[0] | 15 unused elements | arr[1] | 15 unused elements | arr[2] | ...

The code snipped should be modified as such:

// cache line size in number of int elements
#define CLS    (64/sizeof(int))

int arr[4*CLS];

#pragma omp parallel for
for (int i = 0; i < 4; i++)
   arr[i*CLS]++;

Another option that would somehow simplify the code would be to wrap each data element in a structure and put the padding inside the structure:

// cache line size in number of bytes
#define CLS    (64)

typedef struct _item
{
   int data;
   int padding[CLS/sizeof(int)-1];
} item;

item arr[4];

#pragma omp parallel for
for (int i = 0; i < 4; i++)
   arr[i].data++;

No matter which method you use, bear in mind that such code becomes non-portable as various architectures have differing cache line sizes.