I spawn 1 block of 256 threads from my Setup()
kernel to set up an array RNGstates
with 256 CURAND states:
__global__ void Setup(curandState *RNGstates, long seed) {
int tid = threadIdx.x;
curand_init(seed, tid, 0, &RNGstates[tid]);
}
Now, I spawn 1000 blocks of 256 threads from my Generate()
kernel to fill array result
with 256,000 random numbers. However, I do so using only the 256 states of RNGstates
, such that each state will be accessed by 1000 threads (one from each block):
__global__ void Generate(curandState *RNGstates, float *result) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
float rnd = curand_uniform(&RNGstates[threadIdx.x]);
result[tid] = rnd;
}
I know that calling curand_uniform()
updates the states somehow, so I presume some write operation is taking place.
So should I be worried about data races occuring when the 1000 threads mapped to each of the 256 CURAND states try to update the state implicitly through curand_uniform()
? Will this impact the quality of my random numbers (e.g. get frequent duplicate values)?
Many thanks.
I think sharing states will definitely impact the quality. Duplicate values are the best situation for sharing states. Data race could totally ruin the states.
You could keep one state for each of your threads.
When using 1000 blocks, 256,000 states are required for your case. The code should be like
__global__ void Setup(curandState *RNGstates, long seed) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
curand_init(seed, tid, 0, &RNGstates[tid]);
}
and
__global__ void Generate(curandState *RNGstates, float *result) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
float rnd = curand_uniform(&RNGstates[tid]);
result[tid] = rnd;
}
To reduce mem requirement for multiple blocks, you could limit your #block to a small number, and generate multiple random numbers per thread, instead of 1 random number per thread.
__global__ void generate_uniform_kernel(curandState *state,
unsigned int *result)
{
int id = threadIdx.x + blockIdx.x * 64;
unsigned int count = 0;
float x;
/* Copy state to local memory for efficiency */
curandState localState = state[id];
/* Generate pseudo-random uniforms */
for(int n = 0; n < 10000; n++) {
x = curand_uniform(&localState);
/* Check if > .5 */
if(x > .5) {
count++;
}
}
/* Copy state back to global memory */
state[id] = localState;
/* Store results */
result[id] += count;
}
See the section Device API Examples in cuRAND ref manual for complete examples on how to deal with mutiple blocks.
You can also use curandStateMtgp32_t where you need only one per block (if the blocks are at most 256 threads each) http://docs.nvidia.com/cuda/curand/device-api-overview.html#bit-generation-1