I am currently writing a Monte-Carlo simulation in CUDA. As such, I need to generate lots of random numbers on the fly using the cuRAND
library.
Each thread processes one element in a huge float
array (omitted in the example) and generates 1 or 2 random numbers per kernel invocation.
The usual approach (see example below) seems to be to allocate one state per thread. The states are reused in subsequent kernel invocations. However, this does not scale well when the number of threads increases (up to 10⁸ in my application), since it becomes the dominant memory (and bandwidth) usage in the program.
I know that one possible approach would be to process multiple array elements per thread in a grid stride loop fashion. Here I want to investigate a different method.
I am aware that for the compute capability I am targeting (3.5), the maximum number of resident threads per SM is 2048, i.e. 2 blocks in the example below.
Is it possible to use only 2048 states per multiprocessor, regardless of the total number of threads ?
All random numbers generated should still be statistically independent.
I think this might be done if each resident thread were associated a unique index modulo 2048, which could then be used to fetch the state from an array. Does such an index exist ?
More generally, are there other ways to reduce the memory footprint of the RNG states ?
Example code (one state per thread)
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>
#define gridSize 100
#define blockSize 1024
#define iter 100
__global__ void rng_init(unsigned long long seed, curandState * states) {
const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
curand_init(seed, Idx, 0, &states[Idx]);
}
__global__ void kernel(curandState * states) {
const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
const float x = curand_uniform(&states[Idx]);
// Do stuff with x ...
}
int main() {
curandState * states;
assert(cudaMalloc(&states, gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
rng_init<<<gridSize,blockSize>>>(clock(), states);
assert(cudaDeviceSynchronize() == cudaSuccess);
for (size_t i = 0 ; i < iter ; ++i) {
kernel<<<gridSize,blockSize>>>(states);
assert(cudaDeviceSynchronize() == cudaSuccess);
}
return 0;
}
Depending on the RNG type you are using several approaches are possible. In your case, if you have a bit of freedom on the rng type, and if you have a device with good double precision, you may want to completely remove the notion of stored rng state and call init and skip ahead techniques. The only element needed is then the seed and an index that could be computed based on the iteration and simulation id.
See Skip Ahead part of cuRand documentation, and see that most
curand_init
method accept an offset parameter. In some cases, given the nature of the RNG state structure, and the small cost of init, it might be better to callcuda_init
with the appropriate offset on a state data strucure that might reside in register space than to load/store the state data structure from global memory at each random value extract.In short, in your question, you mention using a grid-striding loop as a way to collapse the required state. I think this method, or one like it (as suggested by @talonmies) is the most sensible approach. Choose a method that reduces the thread count to the minimum necessary to keep the machine busy/fully utilized. Then have each thread compute multiple operations, re-using the provided random generator state.
I started with your code shell and turned it into a classical "hello world" MC problem: compute pi based on the ratio of the area of the square to the area of the inscribed circle, randomly generating points to estimate area.
Then we will consider 3 methods:
Create a large 1D grid, as well as state for each thread, so that each thread computes one random point and tests it.
Create a much smaller 1D grid, as well as state for each thread, and allow each thread to compute multiple random points/tests, so that the same number of points/tests are generated as in case 1.
Create a grid the same size as method 2, but also create a "unique resident thread index", and then provide enough state to cover just the unique resident threads. Each thread will effectively use the state provided using the "resident thread index", rather than an ordinary globally unique thread index. Compute the same number of points/tests as in case 1.
The computation of the "unique resident thread index" is not a trivial matter. In host code:
We must determine the maximum number of blocks that are theoretically possible to be resident. I have used a simple heuristic for this, but there are arguably better methods. I simply divide the maximum resident threads per multiprocessor by the number of chosen threads per block. We must use integer division for this.
Then initialize enough state to cover the maximum number of blocks times the number of SMs on the GPU.
In device code:
threadIdx.x
variable to create a "unique resident thread index". This then becomes our typical thread index instead of the usual calculation.From a results perspective, the following observations can be made:
Contrary to what is stated in the other answer, the init time is not insignificant. It should not be ignored. For this simple problem, the init kernel run-time outweighs the calculation kernel run time. Therefore our most important takeaway is that we should seek methods to minimize the creation of random generator state. We certainly do not want to needlessly re-run the initialization. Therefore we should probably discard method 1 based on these results, for this particular code/test.
From a calculation kernel runtime perspective, there is little to commend one kernel over another.
From a code complexity perspective, method 2 is clearly less complex than method 3, while giving about the same performance.
For this particular test case, then, method 2 seems to be the winner, exactly as predicted by @talonmies.
What follows is a worked example of the 3 methods. I'm not claiming that this is defect free or instructive for every case, code, or scenario. There are a lot of moving parts here, but I believe the 3 conclusions above are valid for this case.