When creating a CUDA event, you can optionally turn on the cudaEventBlockingSync
flag. But - what if the difference between creating an event with or without the flag? I read the fine manual; it just doesn't make sense to me. What is the "calling host thread", and what "blocks" when you don't use the flag?
4.6.2.7 cudaError_t cudaEventSynchronize(cudaEvent_t event)
Blocks until the event has actually been recorded. ... Waiting for an event that was created with the cudaEventBlockingSync flag will cause the calling host thread to block until the event has actually been recorded.
cudaEventBlockingSync
will define how the host will wait for the event to happen.When
cudaEventBlockingSync
is SET the CPU can give up the host thread. i.e. The CPU will be passed a different thread (possibly of a process). The host thread will re-acquire the CPU at a later time. With this approach, the host thread does not monopolize all the CPU time, the host can be allowed to do other work.When
cudaEventBlockingSync
is NOT SET the CPU will busy-wait, i.e. the CPU will enter a check-event loop. When this happens the CPU just spins, looking for the event to occur. This usually causes the CPU performance meter to peg-out to 100%. With this approach, the host thread monopolizes all the CPU time.Not setting
cudaEventBlockingSync
results in the minimum latency from kernel execution conclusion to the control returning to the thread. Which setting you want to use depends on what the kernel is doing. i.e. How long will it take for the event to happen, versus, how much schedule overhead is involved with the CPU blocking. Not setting this flag comes at the cost of not being able to do any other CPU work (other threads) while waiting for the event to occur.when you call that function, the thread will stop executing until that event happens, at which time the program continues. It is a way of making sure you know the state of the running program. This is especially important in CUDA because so many things are asynchronous.
The "calling host thread" is the thread that is running on the CPU of the host computer in which the CUDA device resides.
edit in response to comment below:
I believe that the difference between a "blocking sync" and a regular sync is that the thread blocks and will not run until the event is completed, as opposed to a thread that "spins" as it waits, constantly checking the value. This means that the thread will not use any extra CPU time spinning, but will instead be awakened once the event is completed. This is useful if, say, you're running this program on a server where CPU time is at a premium or you have to pay per unit time.