Is there a way to profile an OpenCL or a pyOpenCL

2019-04-11 03:20发布

问题:

I am trying to optimize a pyOpenCL program. For this reason I was wondering if there is a way to profile the program and see where most of the time is needed for.

Do you have any idea how to approach this problem?

Thanks in advance
Andi

EDIT: For example nvidias nvprof for CUDA would do the trick for pyCuda, however, not for pyOpenCL.

回答1:

Yes, there absolutely is - you can profile the individual PyOpenCL events run on the Device, and you can also profile the overall program on the Host.

PyOpenCL events are returned by copying memory to the device, running a kernel on the device, and copying memory back from the device.

Here is an example of profiling a Device event:

event = cl.enqueue_copy(queue, np_array, cl_array)
event.wait()
print (event.profile.end-event.profile.start)*1e-9

Here is an example of profiling on the host:

from time import time, strftime, gmtime
start_time = time()
# ... do some stuff like the above ^
end_time = time()
print strftime('%H:%M:%S', gmtime(end_time - start_time))

I haven't seen a more comprehensive way to profile a PyOpenCL program. Hope that helps though!



回答2:

Ok,
I have figured out a way: The Cuda Toolkit 3.1 offers a profiling for openCL (higher versions will not). From this package use the compute visual profiler which is the (computeprof.exe). It is available for windows and linux here and can be installed alongside a new Cuda Toolkit.
It looks like this:

I hope this helps someone else too.



回答3:

Basically, Nvidia's Visual Profiler (nvvp) used to work for profiling OpenCL (even through pyopenCL) but Nvidia stopped updating it. There's a neat trick, I pulled from the pyopenCL mailinglist and got to work with nvvp using the information from here.

The basic steps are:

  1. Create a nvvp.cfg file, with the configuration for the visual profiler.

Example:

profilelogformat CSV
streamid
gpustarttimestamp
gpuendtimestamp
gridsize
threadblocksize
dynsmemperblock
stasmemperblock
regperthread
memtransfersize
  1. Create a bash script to set the environment variables and launch the python / OpenCL / pyOpenCL process.

Example:

#!/bin/bash
export {CL_,COMPUTE_}PROFILE=1
export COMPUTE_PROFILE_CONFIG=nvvp.cfg
python OpenCL_test.py

This will put a log file in your work directory, which you can inspect. You can import this file into nvvp if you change every occurence of "OPENCL_" to "CUDA_". For further information follow the provided link.



回答4:

In addition to benshope's answer, you should enable profiling of the command queue by creating it via

queue = cl.CommandQueue(context, 
            properties=cl.command_queue_properties.PROFILING_ENABLE)

The PyOpenCL examples contain benchmarking scripts doing some basic profiling (check benchmark.py, dump-performance.py and transpose.py).



回答5:

CodeXL from AMD works very well.