I'll first do some contextualization. I'm trying to implement a non-blocking work stealing method using deques in CUDA. The deques (aDeques) are in a block-segmented array in global memory and the popWork() device function has the objective of popping work to feed threads. In addition of the global deques, each block has a stack in shared memory (aLocalStack) where it can locally work. The pop occurs in 3 levels. First attempt is in the shared stack, second attempt is in the deque owned by the block and third attempt is work steal other deques. Each deque has global bottom and pop pointers that lie in a global memory arrays (aiDequesBottoms and auiDequesAges). My problem is that when a block changes a global deque pointer, the changes aaren't being visible by other blocks when I test code in a GTS450. It seems like cache is not being updated. I have also tested in a GT520 card, where the problem does not occur. I have experienced similar problems with the aiDequeFlags array. These problems are solved by declaring it volatile. Unfortunatelly, I can't do the same to the deque pointer arrays, since I need to use atomic functions on them later. I'm sorry to not put the problem in a simpler example, but I couldn't reproduce this behavior. This first snippet has the popWork() interface explained .
template <int iDequeSize> //Size of each segment in aDeques
bool __inline__ __device__ popWork(
volatile int *aiDequeFlags , //Flags that indicates if a deque is active (has work)
int *aiDequesBottoms , //Deque bottom pointers
unsigned int *auiDequesAges , //Deque top pointers (29 higher bits) +
//Tag bits(3 lower bits).
const Int2Array *aDeques , //Deques (Int2Array is an interface for 2 int arrays)
int &uiStackBot , //Shared memory stack pointer
int2 *aLocalStack , //Shared memory local stack
const int &iTid , //threadIdx.x
const int &iBid , //blockIdx.x
//All other parameters are output
unsigned int &uiPopDequeIdx , //Choosen deque for pop
int2 *popStartIdxAndSize , //Arrays of pop start index and sizes
bool *bPopFlag , //Array of flags for pop in each level
unsigned int &uiActiveDequesIdx , //Flag to indicate pop failed (no more work)
int2 &work //Actual acquired thread work)
This second snippet has the entire function. The kernel that uses the function was launched with 8 blocks, 64 threads and in the beginning just deque 0 has 1 work, while all other deques are empty. There are some debug printf calls to generate a log, which will be show in the next snippet.
template <int iDequeSize>
bool __inline__ __device__ popWork(volatile int *aiDequeFlags , int *aiDequesBottoms , unsigned int *auiDequesAges ,
const Int2Array *aDeques , int &uiStackBot , int2 *aLocalStack , const int &iTid , const int &iBid ,
unsigned int &uiPopDequeIdx , int2 *popStartIdxAndSize , bool *bPopFlag , unsigned int &uiActiveDequesIdx , int2 &work)
{
//Pop from local stack
if(iTid == 0)
{
unsigned int uiAge = 0;
bPopFlag[0] = popBottom(uiStackBot , uiAge , popStartIdxAndSize[iBid]);
bPopFlag[3] = bPopFlag[0];
}
__syncthreads();
if(bPopFlag[0])
{
if(iTid < popStartIdxAndSize[iBid].y)
{
work = aLocalStack[popStartIdxAndSize[iBid].x + iTid];
}
}
else
{
if(iTid == 0)
{ //Try to pop from block deque
bPopFlag[1] = popBottom(aiDequesBottoms[iBid] , auiDequesAges[iBid] , popStartIdxAndSize[iBid]);
if(bPopFlag[1])
{
uiPopDequeIdx = iBid;
//Debug
if(iBid == 0)
{
printf("Block %d pop global deque. Bottom=%d\n" , iBid , aiDequesBottoms[iBid]);
}
//
}
else
{
aiDequeFlags[iBid] = 0;
popStartIdxAndSize[iBid].x = INFTY;
uiPopDequeIdx = INFTY;
}
bPopFlag[3] = bPopFlag[1];
bPopFlag[2] = false;
}
__syncthreads();
if(!bPopFlag[1])
{
//Verify if lazy steal can be done.
if(iTid < NDEQUES)
{
if(popStartIdxAndSize[iTid].x != INFTY && iTid != iBid)
{
atomicMin(&uiPopDequeIdx , iTid);
bPopFlag[2] = true;
bPopFlag[3] = true;
}
}
__syncthreads();
if(iTid == uiPopDequeIdx)
{
popStartIdxAndSize[iBid] = popStartIdxAndSize[iTid];
popStartIdxAndSize[iTid].x = INFTY;
}
while(!bPopFlag[3])
{ //No more work, try to steal some!
__syncthreads();
if(iTid == 0)
{
uiActiveDequesIdx = 0;
}
__syncthreads();
if(iTid < NDEQUES)
{
if(aiDequeFlags[iTid] == 1)
{
uiActiveDequesIdx = 1;
//Debug
printf("Block %d steal attempt on block %d. Victim bottom=%d\n" , blockIdx.x , threadIdx.x , aiDequesBottoms[iTid]);
//
if(popTop(aiDequesBottoms , auiDequesAges , iTid , popStartIdxAndSize[iTid]))
{
aiDequeFlags[iBid] = 1;
atomicMin(&uiPopDequeIdx , iTid);
bPopFlag[3] = true;
//Debug
//printf("%d ss %d %d %d\n" , iBid , iTid , popStartIdxAndSize[iTid].x , popStartIdxAndSize[iTid].y);
//
}
}
}
__syncthreads();
if(uiActiveDequesIdx == 0)
{ //No more work to steal. End.
break;
}
if(iTid == uiPopDequeIdx)
{
popStartIdxAndSize[iBid] = popStartIdxAndSize[iTid];
popStartIdxAndSize[iTid].x = INFTY;
}
__syncthreads();
}
}
__syncthreads();
if(bPopFlag[3] && iTid < popStartIdxAndSize[iBid].y) //assuming number of threads >= WORK_SIZE
{
aDeques->getElement(work , uiPopDequeIdx*iDequeSize + popStartIdxAndSize[iBid].x + iTid);
}
}
return bPopFlag[3];
}
This last snippet is the generated log. The push lines ("Block X push. Bottom=Y") were generated by a push function which was not showed here. Remember that in the beginning, just block 0 has 1 work.
Block 0 pop global deque. Bottom=0
Block 4 steal attempt on block 0. Victim bottom=0
Block 2 steal attempt on block 0. Victim bottom=1
Block 6 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 0. Victim bottom=1
Block 3 steal attempt on block 0. Victim bottom=1
Block 7 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 0. Victim bottom=1
Block 6 steal attempt on block 0. Victim bottom=1
Block 3 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 0. Victim bottom=1
Block 4 steal attempt on block 0. Victim bottom=0
Block 7 steal attempt on block 0. Victim bottom=1
Block 0 push. Bottom=448
Block 2 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 0. Victim bottom=1
Block 6 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 0. Victim bottom=1
Block 3 steal attempt on block 0. Victim bottom=1
Block 7 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 4. Victim bottom=0
Block 1 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 4. Victim bottom=0
Block 5 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 4. Victim bottom=0
Block 4 push. Bottom=384
As can be seen, only block 4 can see the changes in block 0 deque bottom pointer. I tried adding some __threadfence() calls after any change in the pointers but no sucess. Thanks for the attention!