I am trying to optimize a pyOpenCL program. For this reason I was wondering if there is a way to profile the program and see where most of the time is needed for.
Do you have any idea how to approach this problem?
Thanks in advance
Andi
EDIT: For example nvidias nvprof for CUDA would do the trick for pyCuda, however, not for pyOpenCL.
Yes, there absolutely is - you can profile the individual PyOpenCL events run on the Device, and you can also profile the overall program on the Host.
PyOpenCL events are returned by copying memory to the device, running a kernel on the device, and copying memory back from the device.
Here is an example of profiling a Device event:
event = cl.enqueue_copy(queue, np_array, cl_array)
event.wait()
print (event.profile.end-event.profile.start)*1e-9
Here is an example of profiling on the host:
from time import time, strftime, gmtime
start_time = time()
# ... do some stuff like the above ^
end_time = time()
print strftime('%H:%M:%S', gmtime(end_time - start_time))
I haven't seen a more comprehensive way to profile a PyOpenCL program. Hope that helps though!
Ok,
I have figured out a way: The Cuda Toolkit 3.1
offers a profiling for openCL (higher versions will not). From this package use the compute visual profiler
which is the (computeprof.exe)
. It is available for windows and linux here and can be installed alongside a new Cuda Toolkit.
It looks like this:
I hope this helps someone else too.
Basically, Nvidia's Visual Profiler (nvvp) used to work for profiling OpenCL (even through pyopenCL) but Nvidia stopped updating it. There's a neat trick, I pulled from the pyopenCL mailinglist and got to work with nvvp using the information from here.
The basic steps are:
- Create a
nvvp.cfg
file, with the configuration for the visual profiler.
Example:
profilelogformat CSV
streamid
gpustarttimestamp
gpuendtimestamp
gridsize
threadblocksize
dynsmemperblock
stasmemperblock
regperthread
memtransfersize
- Create a bash script to set the environment variables and launch the python / OpenCL / pyOpenCL process.
Example:
#!/bin/bash
export {CL_,COMPUTE_}PROFILE=1
export COMPUTE_PROFILE_CONFIG=nvvp.cfg
python OpenCL_test.py
This will put a log file in your work directory, which you can inspect. You can import this file into nvvp if you change every occurence of "OPENCL_" to "CUDA_". For further information follow the provided link.
In addition to benshope's answer, you should enable profiling of the command queue by creating it via
queue = cl.CommandQueue(context,
properties=cl.command_queue_properties.PROFILING_ENABLE)
The PyOpenCL examples contain benchmarking scripts doing some basic profiling (check benchmark.py
, dump-performance.py
and transpose.py
).
CodeXL from AMD works very well.