早期是否退出线程扰乱块CUDA线程之间的同步？ [重复](Does early exiting

这个问题已经在这里有一个答案：

我可以有下降的线程后使用__syncthreads（）？ 2个回答

我实现与CUDA一定的图像处理算法，我有关于整体的线程同步问题的一些问题。

眼下的问题可以这样来解释：

我们有W * H尺寸的图像。对于图像的每个像素我需要运行9个相同数据并行进程，每个进程给出值的阵列作为结果（阵列对于整个算法具有相同的长度，可以说，N，这大约是20或30 ）。对于每个像素，他们完成他们的计算后这些9个过程会积累它们在阵列天线的最终结果（对于每个像素的单个阵列）。

为了parallelise此，我设计了以下结构：我生成的（10,10,9）的尺寸的块中，这意味着每个线程块将处理一个10×10大小的子图像和每个线程将处理9的1用于单个像素相同的处理。网格尺寸将（W / 10，H / 10,1）在这种情况下。对于一个线程块，我将分配的共享存储器阵列具有100 * N的长度，并且每个线程将根据其当前像素的坐标写入到相应的共享存储器位置。所以，我需要用atomicAdd和__synchthreads（）在这里同步。

这里的问题是，假如某个像素的值为零，那么我们就需要处理它了，所以我想退出这样的像素，否则由于图像的大部分由我会做不必要的工作零（背景）。所以，我觉得写东西像下面这样的：

//X and Y are the coordinates of the current pixel in the input image.
//threadIdx.z gives the index of the process among the 9 for the current pixel. 

int X=blockIdx.x * blockDim.x + threadIdx.x;
int Y=blockIdx.y * blockDim.y + threadIdx.y;
int numOfProcessForTheCurrPixel=threadIdx.z;
int linearIndexOfPixelInBlock=threadIdx.y * blockDim.x + threadIdx.x;

unsigned short pixelValue=tex2D(image,X,Y);
//Here, threads processing zero-pixels will exit immediately.
if(pixelValue==0)
 return;

float resultArray[22];
//Fill the result array according to our algorithm, mostly irrelevant stuff.
ProcessPixel(resultArray,X,Y,numOfProcessForTheCurrPixel);

for(int i=0;i<22;i++)
    atomicAdd(&__sharedMemoryArray[22*linearIndexOfPixelInBlock + i],resultArray[i]);

 __syncthreads(); 
 //Then copy from the shared to the global memory and etc.

我担心在这种情况下是什么编程指南是说：

__syncthreads（）是允许在条件代码，但只有当条件估值相同整个线程块，否则代码执行将有可能挂起或产生不期望的副作用。

所以在我的情况下，如果有的在一个10 * 10线程块中的像素是零和一些与否，则属于零个像素线程将立即在开始退出，其他线程将继续他们的处理。怎么样在这种情况下同步，将它仍然正常工作，还是会产生这样的编程指南说未定义的行为？我想使零个像素线程处理垃圾数据，让他们忙，但是这将unnecesarilly增加处理时间，如果我们有包括零的完全块（和我们有他们经常）。有什么可以在这种情况下怎么办？

为了避免产生死锁，所有线程需要（）无条件地打_synchthreads。您可以通过替换与跃过大部分功能和头直取_syncthreads（的if语句）的零像素情况下，返回做，在你的榜样。

unsigned short pixelValue=tex2D(image,X,Y);
//If there's nothing to compute, jump over all the computation stuff
if(pixelValue!=0)
{

    float resultArray[22];
    //Fill the result array according to our algorithm, mostly irrelevant stuff.
    ProcessPixel(resultArray,X,Y,numOfProcessForTheCurrPixel);

    for(int i=0;i<22;i++)
        atomicAdd(&__sharedMemoryArray[22*linearIndexOfPixelInBlock + i],resultArray[i]);

}

__syncthreads(); 

if (pixelValue != 0)
{
    //Then copy from the shared to the global memory and etc. 
}