CUDA streams destruction and CudaDeviceReset

2020-02-26 15:10发布

问题:

I have implemented the following class using CUDA streams

class CudaStreams
{
    private:
        int             nStreams_;
        cudaStream_t*   streams_;
        cudaStream_t    active_stream_;

    public:

        // default constructor
        CudaStreams() { }

        // streams initialization
        void InitStreams(const int nStreams = 1) {
            nStreams_ = nStreams;
            // allocate and initialize an array of stream handles
            streams_ = (cudaStream_t*) malloc(nStreams_*sizeof(cudaStream_t));
            for(int i = 0; i < nStreams_; i++) CudaSafeCall(cudaStreamCreate(&(streams_[i]))); 

            active_stream_ = streams_[0];}

        // default destructor
        ~CudaStreams() {     
            for(int i = 0; i<nStreams_; i++) CudaSafeCall(cudaStreamDestroy(streams_[i])); }

}; 

If I now run this simple code

void main( int argc, char** argv) 
{
    streams.InitStreams(1);
    streams.~CudaStreams();

    cudaDeviceReset();
}

after the cudaDeviceReset() call, I receive the following message:

Unhandled exception 0x772f15de in test.exe: 0x00000000.

What should I do before invoking the destructor to avoid this issue when using cudaDeviceReset()?

EDIT

If I add free(streams_); in the destructor, namely

~CudaStreams() {     
    for(int i = 0; i<nStreams_; i++) CudaSafeCall(cudaStreamDestroy(streams_[i])); // * 
    free(streams_); }

I receive the following error message

cudaSafeCall() failed at C:\Users\Documents\Project\Library\CudaStreams.cuh:79 : unknown error

where line 79 is that denoted by * in the destructor.

Furthermore, If I use the same instructions of the constructor and the destructor directly inside the code, namely

void main( int argc, char** argv) 
{
    int nStreams_ = 3;
    cudaStream_t* streams_ = (cudaStream_t*) malloc(nStreams_*sizeof(cudaStream_t));
    for(int i = 0; i < nStreams_; i++) CudaSafeCall(cudaStreamCreate(&(streams_[i]))); 
    for(int i = 0; i<nStreams_; i++) CudaSafeCall(cudaStreamDestroy(streams_[i])); 
    free(streams_);

cudaDeviceReset();
}

everything works well. Perheps is something connected to a bad use of the class?

回答1:

There are two problems here, both related to the destructor of your class and scope.

Firstly, let's start with a version of your main() which will work correctly:

int main( int argc, char** argv) 
{
    {
        CudaStreams streams;
        streams.InitStreams(1);
    }

    cudaDeviceReset();

    return 0;
}

This works correctly because the destructor for streams is called exactly once (when streams falls out of scope), and before cudaDeviceReset is called.

Your original main() (or a compilable version of it, but more about that later...) fails for two reasons. Let's look at it again:

int main( int argc, char** argv) 
{
    CudaStreams streams;
    streams.InitStreams(1);
    streams.~CudaStreams();

    cudaDeviceReset();

    return 0;
}

Here you explicitly call the destructor for streams (which you should almost never do), then cudaDeviceReset, then the destructor is called again at the return statement when streams falls out of scope. The automatic calling the destructor after the context is destroyed is the source of the segfault/exception. The cudaStreamDestroy calls are trying to work on streams without a valid CUDA context. So the solution is not to have any classes which make CUDA API calls fall out of scope (or call their destructors explicitly) when there is no context.

If we made a third version like this:

int main( int argc, char** argv) 
{
    {
        CudaStreams streams;
        streams.InitStreams(1);
        streams.~CudaStreams();
    }

    cudaDeviceReset();

    return 0;
}

You will get a CUDA runtime error. Because the destructor gets call twice. The first time (explicit) it will work. The second (implict, out of scope) will produce a runtime error: you have a valid context, but are now trying to destroy non-existent streams.

As a final comment/question: How hard would it have been to post and actual compilable version of the code you showed in your original question? It literally required 5 extra lines to make it into a proper repro case someone else could actual compile and run. I find it a bit unreasonable to expect others to make a effort to answer what are basically debugging questions if you are not willing to make a similar effort in providing useful code and information which makes everyone's life that much easier. Think about it. [end of rant]