OpenMP: don't use hyperthreading cores (half `

2019-05-07 11:14发布

问题:

In Is OpenMP (parallel for) in g++ 4.7 not very efficient? 2.5x at 5x CPU, I determined that the performance of my programme varies between 11s and 13s (mostly always above 12s, and sometimes as slow as 13.4s) at around 500% CPU when using the default #pragma omp parallel for, and the OpenMP speed up is only 2.5x at 5x CPU w/ g++-4.7 -O3 -fopenmp, on a 4-core 8-thread Xeon.

I tried using schedule(static) num_threads(4), and noticed that my programme always completes in 11.5s to 11.7s (always below 12s) at about 320% CPU, e.g., runs more consistently, and uses less resources (even if the best run is half a second slower than the rare outlier with hyperthreading).

Is there any simple OpenMP-way to detect hyperthreading, and reduce num_threads() to the actual number of CPU cores?

(There is a similar question, Poor performance due to hyper-threading with OpenMP: how to bind threads to cores, but in my testing, I found that a mere reduction from 8 to 4 threads somehow already does that job w/ g++-4.7 on Debian 7 wheezy and Xeon E3-1240v3, so, this very question is merely about reducing num_threads() to the number of cores.)

回答1:

If you were running under Linux [also assuming an x86 arch], you could look at /proc/cpuinfo. There are two fields cpu cores and siblings. The first is number of [real] cores and the latter is the number of hyperthreads. (e.g. on my system they are 4 and 8 respectively for my four core hyperthreaded machine).

Because Linux can detect this [and from the link in Zulan's comment], the information is also available from the x86 cpuid instruction.

Either way, there is also an environment variable for this: OMP_NUM_THREADS which may be easier to use in conjunction with a launcher/wrapper script

One thing you may wish to consider is that beyond a certain number of threads, you can saturate the memory bus, and no increase in threads [or cores] will improve performance, and, may in fact, reduce performance.

From this question: Atomically increment two integers with CAS there is a link to a video talk from CppCon 2015 that is in two parts: https://www.youtube.com/watch?v=lVBvHbJsg5Y and https://www.youtube.com/watch?v=1obZeHnAwz4

They're about 1.5 hours each, but, IMO, well worth it.

In the talk, the speaker [who has done a lot of multithread/multicore optimization] says, that from his experience, the memory bus/system tends to get saturated after about four threads.



回答2:

Hyper-Threading is Intel's implementation of simultaneous multithreading (SMT). Current AMD processors don't implement SMT (the Bulldozer microarchitecture family has something else AMD calls cluster based multithreading but the Zen microarchitecture is suppose to have SMT). OpenMP has no builtin support to detect SMT.

If you want a general function to detect Hyper-Threading you need to support different generations of processors and make sure that the processor is an Intel processor and not AMD. It's best to use a library for this.

But you can create a function using OpenMP that works for many modern Intel processors as I described here.

The following code will count the number of physical cores on an modern Intel processors (it has worked on every Intel processor I have tried it on). You have to bind the threads to get this to work. With GCC you can use export OMP_PROC_BIND=true otherwise you can bind with code (which is what I do).

Note that I am not sure this method is reliable with VirtualBox. With VirtualBox on a 4 core/8 logical processor CPU with windows as Host and Linux as guess setting the number of cores for the VM to 4 this code reports 2 cores and /proc/cpuinfo shows that two of the cores are actually logical processors.

#include <stdio.h>

//cpuid function defined in instrset_detect.cpp by Agner Fog (2014 GNU General Public License)
//http://www.agner.org/optimize/vectorclass.zip

// Define interface to cpuid instruction.
// input:  eax = functionnumber, ecx = 0
// output: eax = output[0], ebx = output[1], ecx = output[2], edx = output[3]
static inline void cpuid (int output[4], int functionnumber) {
#if defined (_MSC_VER) || defined (__INTEL_COMPILER)       // Microsoft or Intel compiler, intrin.h included

  __cpuidex(output, functionnumber, 0);                  // intrinsic function for CPUID

#elif defined(__GNUC__) || defined(__clang__)              // use inline assembly, Gnu/AT&T syntax

  int a, b, c, d;
  __asm("cpuid" : "=a"(a),"=b"(b),"=c"(c),"=d"(d) : "a"(functionnumber),"c"(0) : );
  output[0] = a;
  output[1] = b;
  output[2] = c;
  output[3] = d;

#else                                                      // unknown platform. try inline assembly with masm/intel syntax

  __asm {
    mov eax, functionnumber
      xor ecx, ecx
      cpuid;
    mov esi, output
      mov [esi],    eax
      mov [esi+4],  ebx
      mov [esi+8],  ecx
      mov [esi+12], edx
      }

  #endif
}

int getNumCores(void) {
  //Assuming an Intel processor with CPUID leaf 11
  int cores = 0;
  #pragma omp parallel reduction(+:cores)
  {
    int regs[4];
    cpuid(regs,11);
    if(!(regs[3]&1)) cores++;
  }
  return cores;
}

int main(void) {
  printf("cores %d\n", getNumCores());
}