memset in parallel with threads bound to each phys

I have been testing the code at In an OpenMP parallel code, would there be any benefit for memset to be run in parallel? and I'm observing something unexpected.

My system is a single socket Xeon E5-1620 which is an Ivy Bridge processor with 4 physical cores and eight hyper-threads. I'm using Ubuntu 14.04 LTS, Linux Kernel 3.13, GCC 4.9.0, and EGLIBC 2.19. I compile with gcc -fopenmp -O3 mem.c

When I run the code in the link it defaults to eight threads and gives

Touch:   11830.448 MB/s
Rewrite: 18133.428 MB/s

However, when I bind the threads and set the number of threads to the number of physical cores like this

export OMP_NUM_THREADS=4 
export OMP_PROC_BIND=true

I get

Touch:   22167.854 MB/s
Rewrite: 18291.134 MB/s

The touch rate has doubled! Running several times after binding always has touch faster than rewrite. I don't understand this. Why is touch faster than rewrite after binding the threads and setting them to the number of physical cores? Why has the touch rate doubled?

Here is the code I used taken without modification from Hristo Iliev answer.

#include <stdio.h>
#include <string.h>
#include <omp.h>

void zero(char *buf, size_t size)
{
    size_t my_start, my_size;

    if (omp_in_parallel())
    {
        int id = omp_get_thread_num();
        int num = omp_get_num_threads();

        my_start = (id*size)/num;
        my_size = ((id+1)*size)/num - my_start;
    }
    else
    {
        my_start = 0;
        my_size = size;
    }

    memset(buf + my_start, 0, my_size);
}

int main (void)
{
    char *buf;
    size_t size = 1L << 31; // 2 GiB
    double tmr;

    buf = malloc(size);

    // Touch
    tmr = -omp_get_wtime();
    #pragma omp parallel
    {
        zero(buf, size);
    }
    tmr += omp_get_wtime();
    printf("Touch:   %.3f MB/s\n", size/(1.e+6*tmr));

    // Rewrite
    tmr = -omp_get_wtime();
    #pragma omp parallel
    {
        zero(buf, size);
    }
    tmr += omp_get_wtime();
    printf("Rewrite: %.3f MB/s\n", size/(1.e+6*tmr));

    free(buf);

    return 0;
}

Edit: Without tread binding but using four threads here are the results running eight times.

Touch:   14723.115 MB/s, Rewrite: 16382.292 MB/s
Touch:   14433.322 MB/s, Rewrite: 16475.091 MB/s 
Touch:   14354.741 MB/s, Rewrite: 16451.255 MB/s  
Touch:   21681.973 MB/s, Rewrite: 18212.101 MB/s 
Touch:   21004.233 MB/s, Rewrite: 17819.072 MB/s 
Touch:   20889.179 MB/s, Rewrite: 18111.317 MB/s 
Touch:   14528.656 MB/s, Rewrite: 16495.861 MB/s
Touch:   20958.696 MB/s, Rewrite: 18153.072 MB/s

Edit:

I tested this code on two other systems and I can't reproduce the problem on them

i5-4250U (Haswell) - 2 physical cores, 4 hyper-threads

4 threads unbound
    Touch:   5959.721 MB/s, Rewrite: 9524.160 MB/s
2 threads bound to each physical core
    Touch:   7263.175 MB/s, Rewrite: 9246.911 MB/s

Four socket E7- 4850 - 10 physical cores, 20 hyper-threads each socket

80 threads unbound
    Touch:   10177.932 MB/s, Rewrite: 25883.520 MB/s
40 threads bound
    Touch:   10254.678 MB/s, Rewrite: 30665.935 MB/s

This shows that binding the threads to the physical cores does improve the both touch and rewrite but touch is slower than rewrite on these two systems.

I also tested three different variations of memset: my_memset, my_memset_stream, and A_memset. The functions my_memset and my_memset_stream are defined below. The function A_memset comes from Agner Fog's asmlib.

my_memset results:

Touch:   22463.186 MB/s
Rewrite: 18797.297 MB/s

I think this shows that the problem is not in EGLIBC's memset function.

A_memset results:

Touch:   18235.732 MB/s
Rewrite: 44848.717 MB/s

my_memset_stream:

Touch:   18678.841 MB/s
Rewrite: 44627.270 MB/s

Looking at the source code of the asmlib I saw that for writing large chuncks of memory that non temporal stores are used. That's why my_memset_stream get's about the same bandwidth as Agner Fog's asmlib. The maximum throughput of this system is 51.2 GB/s. So this show that A_memset and my_memset_stream get about 85% of that maximum throughput.

void my_memset(int *s, int c, size_t n) {
    int i;
    for(i=0; i<n/4; i++) {
        s[i] = c;
    }
}

void my_memset_stream(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);

    for(i=0; i<n/4; i+=4) {
        _mm_stream_si128((__m128i*)&s[i], v);
    }
}

It would appear from your numbers that your 4 bound threads are running on 2 physical cores instead of the expected 4 physical cores. Can you confirm this? It would explain the doubling of the Touch times. I'm not sure how to force a thread to a physical core when using hyperthreading on your system. {I tried adding this as a question, but have insufficient "reputation"}