CUDA document is not clear on how memory data changes after CUDA applications throws an exception.
For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?
A concrete example would be like this:
- CPU launches a kernel
- The kernel updates the value of __device__ variableA to be 5 and then crashes
- CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?
Can someone show the rationale behind this?
The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.
This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.
Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of
cudaPeekAtLastError
). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of thecudaMemcpy
call to transfer variableA from device to host, so the results of thecudaMemcpy
operation are undefined and unreliable -- it is as if thecudaMemcpy
operation also failed in some unspecified way.Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.
An example of a non-sticky error might be an attempt to
cudaMalloc
more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:
non-sticky, non-cuda-context-corrupting:
sticky, cuda-context-corrupting: