I have a OpenCL program which runs fine for small problems but when running larger problems exceeds the 8-10s time limit for running kernels on Nvidia hardware. Although I have no monitors attached to the GPU I am computing on (Nvidia GTX580), the kernel will always be terminated once it runs for around 8-10s.
The preliminary research I did on this problem indicates that the Nvidia watchdog should only enforce the time limit if a monitor is connected to the graphics card. However I do not have any monitors connected to the GPU the OpenCl is running on yet this limit is still enforced.
Is it possible to disable the Nvidia watchdog or get the drivers to recognise that no monitors are attached to the GTX580 in Mac OS X 10.7.4?
I am aware that a possible way around this is to break up into smaller kernels however due to the nature of my work I could still hit this limit when I go more fine grain.
The system I am compiling/running on is as follows :
- MacPro4,1 2 x 2.26 GHz Quad Core Intel Xeon
- Mac OS X 10.7.4
- XCode 4.3.3
- Nvidia GT120 (2 Monitors attached)
- NVidia GTX580 (Nothing attached)
For extra information when I run the NVidia device query I get the following output :
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "GeForce GTX 580"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1536 MBytes (1610285056 bytes)
(16) Multiprocessors x ( 32) CUDA Cores/MP: 512 CUDA Cores
GPU Clock rate: 1564 MHz (1.56 GHz)
Memory Clock rate: 2004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce GT 120"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536543232 bytes)
( 4) Multiprocessors x ( 8) CUDA Cores/MP: 32 CUDA Cores
GPU Clock rate: 1400 MHz (1.40 GHz)
Memory Clock rate: 800 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 5 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 2, Device = GeForce GTX 580, Device = GeForce GT 120
[deviceQuery] test results...
PASSED
> exiting in 3 seconds: 3...2...1...done!
I have been looking for an answer to that question some time ago too... I never found a solution.
But this is something that shouldn't be "solved" anyway.
What would you do if a user only had one GPU? Let the users PC simply freeze for more than 10 seconds? That would be fun software to work with...
I'd split it into multiple sub-kernels OR if you have a loop in your kernel, you could also run one iteration per kernel execution. Simply save the current iterator value to a buffer and then restart from that point in the next execution.
This should be applicable to your application really well because it seems like it's not some kind of real-time application like a game.