Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
__global__ void cudaRand(float *d_out, unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed, i, 0, &state);
d_out[i] = curand_uniform(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
float *v = new float[N];
float *d_out;
cudaMalloc((void**)&d_out, N * sizeof(float));
// generate random numbers
cudaRand << < 1, N >> > (d_out, time(NULL));
cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
{
printf("out: %f\n", v[i]);
}
cudaFree(d_out);
delete[] v;
return 0;
}
UPDATE
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>
__global__ void cudaRand(double *d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock() + i, 0, 0, &state);
d_out[i] = curand_uniform_double(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
double *h_v = new double[N];
double *d_out;
cudaMalloc((void**)&d_out, N * sizeof(double));
// generate random numbers
cudaRand << < 1, N >> > (d_out);
cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
printf("out: %f\n", h_v[i]);
cudaFree(d_out);
delete[] h_v;
return 0;
}
How I was dealing with the similar situation in the past, within
__device__
/__global__
function:So just use
curand_uniform_double
for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by usingclock() + tId
instead. This way the odds of having the samerand1
/rand2
in any of the two threads are close to nil.EDIT:
However, based on below comments, proposed approach may perhaps lead to biased result:
JackOLantern pointed me to this part of curand documentation:
Also there is a devtalk thread devoted to how to improve performance of
curand_init
in which the proposed solution to speed up the curand initialization is:But the same poster is later stating:
So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:
Using not
0
value foroffset
andsubsequence
parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.I see that JackOLantern stated in comment that:
I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating
curandState
in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:curandState
in each thread turned out to be superior,curandState
s and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values foroffset
/subsequence
parameters ofcurand_uniform_double
in order to get rid of bias which led to decreased performance when generating numbers.So after making thorough analysis I decided to indeed call
curand_init
andcurand_uniform_double
in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.