Different results for CUDA addition on host and on

I have a function wich takes a color picture and returns the gray version of it. If I run the sequential code on the host, everything works perfectly. If I run it on the device, the result is slightly different (one pixel in 1000 is either +1 or -1 compared to the correct value).

I think this has something to do with the conversions, but I don't know for sure. This is the code I use:

    __global__ void rgb2gray_d (unsigned char *deviceImage, unsigned char *deviceResult, const int height, const int width){
    /* calculate the global thread id*/
    int threadsPerBlock  = blockDim.x * blockDim.y;
    int threadNumInBlock = threadIdx.x + blockDim.x * threadIdx.y;
    int blockNumInGrid   = blockIdx.x  + gridDim.x  * blockIdx.y;

    int globalThreadNum = blockNumInGrid * threadsPerBlock + threadNumInBlock;
    int i = globalThreadNum;

    float grayPix = 0.0f;
    float r = static_cast< float >(deviceImage[i]);
    float g = static_cast< float >(deviceImage[(width * height) + i]);
    float b = static_cast< float >(deviceImage[(2 * width * height) + i]);
    grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);

    deviceResult[i] = static_cast< unsigned char > (grayPix);
}

void rgb2gray(unsigned char *inputImage, unsigned char *grayImage, const int width, const int height, NSTimer &timer) {

    unsigned char *deviceImage;
    unsigned char *deviceResult;

    int initialBytes = width * height * 3;  
    int endBytes =  width * height * sizeof(unsigned char);

    unsigned char grayImageSeq[endBytes];

    cudaMalloc((void**) &deviceImage, initialBytes);
    cudaMalloc((void**) &deviceResult, endBytes);
    cudaMemset(deviceResult, 0, endBytes);
    cudaMemset(deviceImage, 0, initialBytes);

    cudaError_t err = cudaMemcpy(deviceImage, inputImage, initialBytes, cudaMemcpyHostToDevice);    

    // Convert the input image to grayscale 
    rgb2gray_d<<<width * height / 256, 256>>>(deviceImage, deviceResult, height, width);
    cudaDeviceSynchronize();

    cudaMemcpy(grayImage, deviceResult, endBytes, cudaMemcpyDeviceToHost);

    ////// Sequential
    for ( int y = 0; y < height; y++ ) {
             for ( int x = 0; x < width; x++ ) {
                   float grayPix = 0.0f;
                   float r = static_cast< float >(inputImage[(y * width) + x]);
                   float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
                   float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);

                   grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);
                   grayImageSeq[(y * width) + x] = static_cast< unsigned char > (grayPix);
              }
        }

    //compare sequential and cuda and print pixels that are wrong
    for (int i = 0; i < endBytes; i++)
    {
        if (grayImage[i] != grayImageSeq[i])
        cout << i << "-" << static_cast< unsigned int >(grayImage[i]) <<
                 " should be " << static_cast< unsigned int >(grayImageSeq[i]) << endl;
        }

    cudaFree(deviceImage);
    cudaFree(deviceResult);
}

I mention that I allocate for the initial image width * height * 3 because the initial image is a CImg.

I work on a GeForce GTX 480.

标签： cuda

2条回答

爷的心禁止访问

2楼-- · 2020-04-08 21:22

Finally I found the answer. CUDA does automatically fused multiply-add in both single and double precision. Using the document below 1, Section 4.4, I managed to fix it. Instead of doing

grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);

I am now doing

grayPix = __fadd_rn(__fadd_rn(__fmul_rn(0.3f, r),__fmul_rn(0.59f, g)), __fmul_rn(0.11f, b));

This disables the merging of multiplies and adds into fused multiply-adds instructions.

Floating Point and IEEE 754 Compliance for NVIDIA GPUs

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2020-04-08 21:37

Floating point math can produce slightly different results in device code versus host code.

There are multiple possibilities why this is the case. You have to consider that these two functions are compiled by two different compilers into two different binary programs running on two different floating point hardware implementations.

For example if the floating point calculations are executed in different order rounding errors can cause different results.

Also when running floating point calculations with the 32-bit (float) or 64-bit (double) floating point representations on a x86 architecture CPU the floating point math is done by the FPU unit that internally uses 80-bit precision and the result is then truncated back to 32-bit for float data type or 64-bit for the double data type.

The ALUs of a GPU use 32-bit precision for floating point math (assuming you are using the float data type).

An excellent article discussing the topic of floating point representations and arithmetics can be found here.

0人赞添加讨论(0) 举报

Different results for CUDA addition on host and on

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间