Reading updated memory from other CUDA stream

2019-09-02 12:47发布

问题:

I am trying to set a flag in one kernel function and read it in another. Basically, I'm trying to do the following.

#include <iostream>                                                              
#include <cuda.h>                                                                
#include <cuda_runtime.h>                                                        

#define FLAGCLEAR 0                                                              
#define FLAGSET   1                                                              

using namespace std;                                                             

__global__ void set_flag(int *flag)                                              
{                                                                                
    *flag = FLAGSET;                                                             

    // Wait for flag to reset.                                                   
    while (*flag == FLAGSET);                                                    
}                                                                                

__global__ void read_flag(int *flag)                                             
{                                                                                
    // wait for the flag to set.                                                 
    while (*flag != FLAGSET);                                                    

    // Clear it for next time.                                                   
    *flag = FLAGCLEAR;                                                           
}                                                                                

int main(void)                                                                   
{                                                                                
    // Setup memory for flag                                                     
    int *flag;                                                                   
    cudaMalloc(&flag, sizeof(int));                                              

    // Setup streams                                                             
    cudaStream_t stream0, stream1;                                               
    cudaStreamCreate(&stream0);                                                  
    cudaStreamCreate(&stream1);                                                  

    // Print something to let me know that we started.                           
    cout << "Starting the flagging" << endl;                                     

    // do the flag test                                                          
    set_flag  <<<1,1,0,stream0>>>(flag);                                         
    read_flag <<<1,1,0,stream1>>>(flag);                                         

    // Wait for the streams                                                      
    cudaDeviceSynchronize();                                                     

    // Getting here is a painful process!
    cout << "Finished the flagging" << endl;                                     

    // Clean UP!                                                                 
    cudaStreamDestroy(stream0);                                                  
    cudaStreamDestroy(stream1);                                                  
    cudaFree(flag);                                                              

}

I eventually get the second printout, but only after the computer freezes for 15 seconds, and I get both printouts at the same time. These streams are supposed to run in parallel, and not bog the system down. What am I doing wrong? How can I fix this?

Thanks.

EDIT

It seems as though a special case has been solved by adding volitile but now something else has broken. If I add anything between the two kernel calls, the system reverts back to the old behavior, namely freezing and printing everything at once. This behavior is shown by adding sleep(2); between set_flag and read_flag. Also, when put in another program, this causes the GPU to lock up. What am I doing wrong now?

Thanks again.

回答1:

The compiler is allowed to do fairly aggressive optimization. Furthermore, the L1 caches on Fermi devices are not guaranteed to be coherent. To work around these issues, try adding the volatile keyword to your functions usage of the flag variable like so:

__global__ void set_flag(volatile int *flag)       

and

__global__ void read_flag(volatile int *flag)     

Generally speaking, when used on a variable resident in global memory, this will cause the compiler to issue loads that bypass the L1 cache and will also generally speaking prevent optimizations of these variables into registers, for example.

I think you'll have better results.

The code you've posted has the potential to deadlock due to these issues. Therefore, the observation you're seeing may actually be the OS (e.g. windows TDR) interrupting your program.