TERCOM algorithm - Changing from single thread to

I'm currently working on porting a TERCOM algorithm from using only 1 thread to use multiple threads. Briefly explained , the TERCOM algorithm receives 5 measurements and the heading, and compare this measurements to a prestored map. The algorithm will choose the best match, i.e. lowest Mean Absolute Difference (MAD), and return the position.

The code is working perfectly with one thread and for-loops, but when I try to use multiple threads and blocks it returns the wrong answer. It seems like the multithread version doesn't "run through" the calculation in the same way as the singlethread versjon. Does anyone know what I am doing wrong?

Here's the code using for-loops

__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    //Without threads
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    //Calculate Mean Absolute Difference
    for(float row=0;row<m;row++)
    {
        for(float col=0;col<n;col++)
        {
            for(float g=0; g<N; g++)
            {
                f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
                MAD += abs(measurements[(int)g]-f[(int)g]);
            }
            if(MAD<min) 
            {
                min=MAD;
                pos[0]=col;
                pos[1]=row;
            }
            MAD=0;                  //Reset MAD
        }
    }

    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}

This is my attempt to use multiple threads

__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    // With threads
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    if(idx < n && idy < m)
    {
        for(float g=0; g<N; g++)
        {
            f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
            MAD += abs(measurements[(int)g]-f[(int)g]); 
        }

        if(MAD<min) 
        {
            min=MAD;
            pos[0]=idx;
            pos[1]=idy;
        }
        MAD=0;                  //Reset MAD
    }
    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}

To launch the kernel

dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;

kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);

The basic problem here is that you have a memory race in the code, centered around the use of f as both some sort of thread local scratch space and an output variable. Every concurrent thread will be trying to write values into the same locations in f simultaneously, which will produce undefined behaviour.

As best as I can tell, the use of f as scratch space isn't even necessary at all and the main computational section of the kernel could be written as something like:

if(idx < n && idy < m)
{
    for(float g=0; g<N; g++)
    {
        float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
        MAD += abs(measurements[(int)g]-fval); 
    }
    min=MAD;
    pos[0]=idx;
    pos[1]=idy;
}

[disclaimer: written in browser, use at own risk]

At the end of that calculation, each thread has its own values of min and pos. At a minimum these must be stored in unique global memory (ie. the output must have enough space for each thread result). You will then need to perform some sort of reduction operation to obtain the global minimum from the set of thread local values. That could be in the host, or in the device code, or some combination of the two. There is a lot of code already available for CUDA parallel reductions which you should be able to find by searching and/or looking in the examples supplied with the CUDA toolkit. It should be trivial to adapt them to your specify case where you need to retain the position along with the minimum value.