I had a quick look on the forums and I don't think this question has been asked already.
I am currently working with an MPI/CUDA hybrid code, made by somebody else during his PhD. Each CPU has its own GPU. My task is to gather data by running the (already working) code, and implement extra things. Turning this code into a single CPU / Multi-GPU one is not an option at the moment (later, possibly.).
I would like to make use of performance profiling tools to analyse the whole thing.
For now an idea is to have each CPU launch nvvp for its own GPU and gather data, while another profiling tool will take care of general CPU/MPI part (I plan to use TAU, as I usually do).
Problem is, launching nvvp's interface 8 simultaneous times (if running with 8 CPU/GPUs) is extremely annoying. I would like to avoid going through the interface, and get a command line that directly writes the data in a file, that I can feed to nvvc's interface later and analyse.
I'd like to get a command line that will be executed by each CPU and will produce for each of them a file giving data about their own GPU. 8 (GPUs/CPUs) = 8 files. Then I plan to individually feed and analyse these files with nvcc one by one, comparing the data manually.
Any idea ?
Thanks !
Another option is since you are already using TAU to profile the CPU side of the application you could also use TAU to collect the GPU performance data. TAU supports multi-gpu execution along with MPI, take a look at http://www.nic.uoregon.edu/tau-wiki/Guide:TAUGPU for instructions on how to get started using TAU's GPU profiling capabilites. TAU uses CUPTI (CUda Performance Tools Interface) underneath and so the data you will be able to collect with TAU will be very similar to what to can collect with nVidia's Visual Profiler.
Apparently since 2015 it is possible to auto-annotated MPI calls via NVTX and mpi_interceptions.so library when using nvprof profiler:
https://devblogs.nvidia.com/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/
http://on-demand.gputechconf.com/gtc/2017/presentation/s7495-jain-optimizing-application-performance-cuda-profiling-tools.pdf
TAO still does not support distributed deep learning according to this presentation:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7684-allen-malony-performance-analysis-of-cuda-deep-learning-networks-using-tau.pdf
Things have changed since CUDA 5.0 and now we can simply use
%h
,%p
and%q{ENV}
as mentioned here instead of using a wrapper script:Take a look at
nvprof
, part of the CUDA 5.0 Toolkit (currently available as a release candidate). There are some limitations - it can only collect a limited number of counters in a given pass and it cannot collect metrics (so for now you'd have to script multiple launches if you want more than a few events). You can get more information from the nvvp built-in help, including an example MPI launch script (copied here but I suggest you check out the nvvp help for an up-to-date version if you have anything newer than the 5.0 RC).