I have a function wich takes a color picture and returns the gray version of it. If I run the sequential code on the host, everything works perfectly. If I run it on the device, the result is slightly different (one pixel in 1000 is either +1 or -1 compared to the correct value).
I think this has something to do with the conversions, but I don't know for sure. This is the code I use:
__global__ void rgb2gray_d (unsigned char *deviceImage, unsigned char *deviceResult, const int height, const int width){
/* calculate the global thread id*/
int threadsPerBlock = blockDim.x * blockDim.y;
int threadNumInBlock = threadIdx.x + blockDim.x * threadIdx.y;
int blockNumInGrid = blockIdx.x + gridDim.x * blockIdx.y;
int globalThreadNum = blockNumInGrid * threadsPerBlock + threadNumInBlock;
int i = globalThreadNum;
float grayPix = 0.0f;
float r = static_cast< float >(deviceImage[i]);
float g = static_cast< float >(deviceImage[(width * height) + i]);
float b = static_cast< float >(deviceImage[(2 * width * height) + i]);
grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);
deviceResult[i] = static_cast< unsigned char > (grayPix);
}
void rgb2gray(unsigned char *inputImage, unsigned char *grayImage, const int width, const int height, NSTimer &timer) {
unsigned char *deviceImage;
unsigned char *deviceResult;
int initialBytes = width * height * 3;
int endBytes = width * height * sizeof(unsigned char);
unsigned char grayImageSeq[endBytes];
cudaMalloc((void**) &deviceImage, initialBytes);
cudaMalloc((void**) &deviceResult, endBytes);
cudaMemset(deviceResult, 0, endBytes);
cudaMemset(deviceImage, 0, initialBytes);
cudaError_t err = cudaMemcpy(deviceImage, inputImage, initialBytes, cudaMemcpyHostToDevice);
// Convert the input image to grayscale
rgb2gray_d<<<width * height / 256, 256>>>(deviceImage, deviceResult, height, width);
cudaDeviceSynchronize();
cudaMemcpy(grayImage, deviceResult, endBytes, cudaMemcpyDeviceToHost);
////// Sequential
for ( int y = 0; y < height; y++ ) {
for ( int x = 0; x < width; x++ ) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * width) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);
grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);
grayImageSeq[(y * width) + x] = static_cast< unsigned char > (grayPix);
}
}
//compare sequential and cuda and print pixels that are wrong
for (int i = 0; i < endBytes; i++)
{
if (grayImage[i] != grayImageSeq[i])
cout << i << "-" << static_cast< unsigned int >(grayImage[i]) <<
" should be " << static_cast< unsigned int >(grayImageSeq[i]) << endl;
}
cudaFree(deviceImage);
cudaFree(deviceResult);
}
I mention that I allocate for the initial image width * height * 3 because the initial image is a CImg.
I work on a GeForce GTX 480.
Finally I found the answer. CUDA does automatically fused multiply-add in both single and double precision. Using the document below 1, Section 4.4, I managed to fix it. Instead of doing
I am now doing
This disables the merging of multiplies and adds into fused multiply-adds instructions.
Floating Point and IEEE 754 Compliance for NVIDIA GPUs
Floating point math can produce slightly different results in device code versus host code.
There are multiple possibilities why this is the case. You have to consider that these two functions are compiled by two different compilers into two different binary programs running on two different floating point hardware implementations.
For example if the floating point calculations are executed in different order rounding errors can cause different results.
Also when running floating point calculations with the 32-bit (float) or 64-bit (double) floating point representations on a x86 architecture CPU the floating point math is done by the FPU unit that internally uses 80-bit precision and the result is then truncated back to 32-bit for float data type or 64-bit for the double data type.
The ALUs of a GPU use 32-bit precision for floating point math (assuming you are using the float data type).
An excellent article discussing the topic of floating point representations and arithmetics can be found here.