Minimize the number of cuRAND states stored during

2019-06-12 21:49发布

问题:

I am currently writing a Monte-Carlo simulation in CUDA. As such, I need to generate lots of random numbers on the fly using the cuRAND library. Each thread processes one element in a huge float array (omitted in the example) and generates 1 or 2 random numbers per kernel invocation.

The usual approach (see example below) seems to be to allocate one state per thread. The states are reused in subsequent kernel invocations. However, this does not scale well when the number of threads increases (up to 10⁸ in my application), since it becomes the dominant memory (and bandwidth) usage in the program.

I know that one possible approach would be to process multiple array elements per thread in a grid stride loop fashion. Here I want to investigate a different method.

I am aware that for the compute capability I am targeting (3.5), the maximum number of resident threads per SM is 2048, i.e. 2 blocks in the example below.
Is it possible to use only 2048 states per multiprocessor, regardless of the total number of threads ? All random numbers generated should still be statistically independent.

I think this might be done if each resident thread were associated a unique index modulo 2048, which could then be used to fetch the state from an array. Does such an index exist ?

More generally, are there other ways to reduce the memory footprint of the RNG states ?

Example code (one state per thread)

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>

#define gridSize 100
#define blockSize 1024
#define iter 100

__global__ void rng_init(unsigned long long seed, curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  curand_init(seed, Idx, 0, &states[Idx]);
}

__global__ void kernel(curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  const float x = curand_uniform(&states[Idx]);
  // Do stuff with x ...
}

int main() {
  curandState * states;
  assert(cudaMalloc(&states, gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
  rng_init<<<gridSize,blockSize>>>(clock(), states);
  assert(cudaDeviceSynchronize() == cudaSuccess);

  for (size_t i = 0 ; i < iter ; ++i) {
    kernel<<<gridSize,blockSize>>>(states);
    assert(cudaDeviceSynchronize() == cudaSuccess);
  }
  return 0;
}

回答1:

In short, in your question, you mention using a grid-striding loop as a way to collapse the required state. I think this method, or one like it (as suggested by @talonmies) is the most sensible approach. Choose a method that reduces the thread count to the minimum necessary to keep the machine busy/fully utilized. Then have each thread compute multiple operations, re-using the provided random generator state.

I started with your code shell and turned it into a classical "hello world" MC problem: compute pi based on the ratio of the area of the square to the area of the inscribed circle, randomly generating points to estimate area.

Then we will consider 3 methods:

  1. Create a large 1D grid, as well as state for each thread, so that each thread computes one random point and tests it.

  2. Create a much smaller 1D grid, as well as state for each thread, and allow each thread to compute multiple random points/tests, so that the same number of points/tests are generated as in case 1.

  3. Create a grid the same size as method 2, but also create a "unique resident thread index", and then provide enough state to cover just the unique resident threads. Each thread will effectively use the state provided using the "resident thread index", rather than an ordinary globally unique thread index. Compute the same number of points/tests as in case 1.

The computation of the "unique resident thread index" is not a trivial matter. In host code:

  1. We must determine the maximum number of blocks that are theoretically possible to be resident. I have used a simple heuristic for this, but there are arguably better methods. I simply divide the maximum resident threads per multiprocessor by the number of chosen threads per block. We must use integer division for this.

  2. Then initialize enough state to cover the maximum number of blocks times the number of SMs on the GPU.

In device code:

  1. Determine which SM I am on.
  2. Using that SM, perform an atomic test to retrieve a unique block number per SM.
  3. Combine the results of 1, and 2, plus the threadIdx.x variable to create a "unique resident thread index". This then becomes our typical thread index instead of the usual calculation.

From a results perspective, the following observations can be made:

  1. Contrary to what is stated in the other answer, the init time is not insignificant. It should not be ignored. For this simple problem, the init kernel run-time outweighs the calculation kernel run time. Therefore our most important takeaway is that we should seek methods to minimize the creation of random generator state. We certainly do not want to needlessly re-run the initialization. Therefore we should probably discard method 1 based on these results, for this particular code/test.

  2. From a calculation kernel runtime perspective, there is little to commend one kernel over another.

  3. From a code complexity perspective, method 2 is clearly less complex than method 3, while giving about the same performance.

For this particular test case, then, method 2 seems to be the winner, exactly as predicted by @talonmies.

What follows is a worked example of the 3 methods. I'm not claiming that this is defect free or instructive for every case, code, or scenario. There are a lot of moving parts here, but I believe the 3 conclusions above are valid for this case.

$ cat t1201.cu
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>
#include <iostream>
#include <stdlib.h>

#define blockSize 1024

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

#define MAX_SM 64

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


__device__ unsigned long long count = 0;
__device__ unsigned int blk_ids[MAX_SM] = {0};

__global__ void rng_init(unsigned long long seed, curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  curand_init(seed, Idx, 0, &states[Idx]);
}

__global__ void kernel(curandState * states, int length) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = 0; i < length; i++){
    const float x = curand_uniform(&states[Idx]);
    const float y = curand_uniform(&states[Idx]);
    if (sqrtf(x*x+y*y)<1.0)
      atomicAdd(&count, 1ULL);}
}

static __device__ __inline__ int __mysmid(){
  int smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  return smid;}

__device__ int get_my_resident_thread_id(int sm_blk_id){
  return __mysmid()*sm_blk_id + threadIdx.x;
}

__device__ int get_block_id(){
  int my_sm = __mysmid();
  int my_block_id = -1;
  bool done = false;
  int i = 0;
  while ((!done)&&(i<32)){
    unsigned int block_flag = 1<<i;
    if ((atomicOr(blk_ids+my_sm, block_flag)&block_flag) == 0){my_block_id = i; done = true;}
    i++;}
  return my_block_id;
}

__device__ void release_block_id(int block_id){
  unsigned int block_mask = ~(1<<block_id);
  int my_sm = __mysmid();
  atomicAnd(blk_ids+my_sm, block_mask);
}

__global__ void kernel2(curandState * states, int length) {

  __shared__ volatile int my_block_id;
  if (!threadIdx.x) my_block_id = get_block_id();
  __syncthreads();
  const size_t Idx = get_my_resident_thread_id(my_block_id);
  for (int i = 0; i < length; i++){
    const float x = curand_uniform(&states[Idx]);
    const float y = curand_uniform(&states[Idx]);
    if (sqrtf(x*x+y*y)<1.0)
      atomicAdd(&count, 1ULL);}
  __syncthreads();
  if (!threadIdx.x) release_block_id(my_block_id);
  __syncthreads();
}



int main(int argc, char *argv[]) {
  int gridSize = 10;
  if (argc > 1) gridSize = atoi(argv[1]);
  curandState * states;
  assert(cudaMalloc(&states, gridSize*gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
  unsigned long long hcount;
  //warm - up
  rng_init<<<gridSize*gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  //method 1: 1 curand state per point
  std::cout << "Method 1 init blocks: " << gridSize*gridSize << std::endl;
  unsigned long long dtime = dtime_usec(0);
  rng_init<<<gridSize*gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  unsigned long long initt = dtime_usec(dtime);
  kernel<<<gridSize*gridSize,blockSize>>>(states, 1);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 1 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  //method 2: 1 curand state per gridSize points
  std::cout << "Method 2 init blocks: " << gridSize << std::endl;
  dtime = dtime_usec(0);
  rng_init<<<gridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  initt = dtime_usec(dtime);
  kernel<<<gridSize,blockSize>>>(states, gridSize);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 2 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  //method 3: 1 curand state per resident thread
  // compute the maximum number of state entries needed
  int num_sms;
  cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
  int max_sm_threads;
  cudaDeviceGetAttribute(&max_sm_threads, cudaDevAttrMaxThreadsPerMultiProcessor, 0);
  int max_blocks = max_sm_threads/blockSize;
  int total_state = max_blocks*num_sms*blockSize;
  int rgridSize = (total_state + blockSize-1)/blockSize;
  std::cout << "Method 3 sms: " << num_sms << " init blocks: " << rgridSize << std::endl;
  // run test
  dtime = dtime_usec(0);
  rng_init<<<rgridSize,blockSize>>>(1234ULL, states);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  initt = dtime_usec(dtime);
  kernel2<<<gridSize,blockSize>>>(states, gridSize);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  dtime = dtime_usec(dtime) - initt;
  cudaMemcpyFromSymbol(&hcount, count, sizeof(unsigned long long));
  std::cout << "method 3 elapsed time: " << dtime/(float)USECPSEC << " init time: " << initt/(float)USECPSEC << " pi: " << 4.0f*hcount/(float)(gridSize*gridSize*blockSize) << std::endl;
  hcount = 0;
  cudaMemcpyToSymbol(count, &hcount, sizeof(unsigned long long));
  return 0;
}

$ nvcc -arch=sm_35 -O3 -o t1201 t1201.cu
$ ./t1201 28
Method 1 init blocks: 784
method 1 elapsed time: 0.001218 init time: 3.91075 pi: 3.14019
Method 2 init blocks: 28
method 2 elapsed time: 0.00117 init time: 0.003629 pi: 3.14013
Method 3 sms: 14 init blocks: 28
method 3 elapsed time: 0.001193 init time: 0.003622 pi: 3.1407
$


回答2:

Depending on the RNG type you are using several approaches are possible. In your case, if you have a bit of freedom on the rng type, and if you have a device with good double precision, you may want to completely remove the notion of stored rng state and call init and skip ahead techniques. The only element needed is then the seed and an index that could be computed based on the iteration and simulation id.

See Skip Ahead part of cuRand documentation, and see that most curand_init method accept an offset parameter. In some cases, given the nature of the RNG state structure, and the small cost of init, it might be better to call cuda_init with the appropriate offset on a state data strucure that might reside in register space than to load/store the state data structure from global memory at each random value extract.