How to offload particular thread of a single app t

2019-08-28 17:31发布

问题:

Suppose I have a single c/c++ app running on the host. there are few threads running on the host CPU and 50 threads running on the Xeon Phi cores.

How can I make sure that each of these 50 runs on its own Xeon Phi core and is never purged off the core cache (given the code is small enough).

Could someone please to outline a very general idea how to do this and which tool/API would be more suitable (for C/C++ code) ?

What is the fastest way to exchange data between the host thread-aggregator and the 50 Phi threads?

Given that the actual parallelism will be very limited - this application is going to be more like 51 thread plane application with some basic multithreading data sync.

Can I use conventional C/C++ compiler to create the app like this?

回答1:

You have raised several questions:

  1. Yes, you can use conventional C program and compile it using regular Intel C/C++/Fortran compilers (known as Intel Composer XE) in order to generate binary being able to run on Intel Xeon Phi co-processor in either "native"/"symmetric" or "offload" modes. In simplest case - you just recompile your C/C++ program with -mmic and run it "natively" on Phi just "as is".

  2. Which API to use? Use OpenMP4.0 standard or Intel Cilk Plus programming models (actually set of pragmas or keywords applicable to C/C++). OpenCL, Intel TBB and likely OpenACC are also possible, but OpenMP and Cilk Plus have capability to express threading, vectorization and offload (i.e. 3 things essential for Xeon Phi programming) without re-factoring or rewriting "conventional C/C++/Fortran" program .

  3. Threads pinning: could be achieved via OpenMP affinity (see more details on MIC_KMP_AFFINITY below) or Intel TBB affinity stuff.

  4. The fastest way to exchange the data between the host and target Phi - is.. avoid any exchange -using MPI symmetric approach for example. However you seem to ask about "offload" programming model specifically, so using asynchronous offload you can achieve the best performance. At the same time synchronous offload is theoretically simpler in terms of programming, but worse in terms of achievable performance.

Overall, you tend to ask several general questions, so I would recommend to start from the very beginning - i.e. looking at following ~10-pages Dr. Dobbs manual or given Intel' intro document.


Threads pinning is more advanced topic and at the same time seems to be "most interesting" for you, so I will explicitly explain more:

  • If your code is parallelized using OpenMP4.0 standard, then you can achieve desirable behavior using MIC_KMP_AFFINITY / MIC_KMP_PLACE_THREADS for Xeon Phi and KMP_AFFINITY / KMP_PLACE_THREADS for Host CPU.
  • Quite likely you're looking for this specific setting: MIC_KMP_PLACE_THREADS=50c,1t
  • I've seen that people mention PHI_KMP_AFFINITY instead of MIC_KMP_AFFINITY. I believe they are aliased, but didn't try myself.
  • Using 50 threads on Xeon Phi is usually not the best idea. It's better to try around 120 threads or so
  • More details about affinity on Xeon Phi are explained in these 3 articles: http://www.prace-project.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML#id-1.6.2.3 and https://software.intel.com/en-us/articles/best-known-methods-for-using-openmp-on-intel-many-integrated-core-intel-mic-architecture and https://software.intel.com/en-us/articles/openmp-thread-affinity-control