Controlling the index variables in C++ AMP

I have just started trying C++ AMP and I decided to give it a shot with the current project I am working on. At some point, I have to build a distance matrix for the vectors I have and I have written the code below for this

unsigned int samplesize=samplelist.size();
unsigned int vs = samplelist.front().size();

vector<double> samplevec(samplesize*vs);
vector<double> distancevec(samplesize*samplesize,0);

it1=samplelist.begin();

for(int i=0 ; i<samplesize; ++i){
    for(int j = 0 ; j<vs ; ++j){
        samplevec[j + i*vs] = (*it1)[j];
    }
    ++it1;
}

array_view<const double,2> samplearray(samplesize,vs,samplevec);
array_view<writeonly<double>,2> distances(samplesize,samplesize,distancevec);

parallel_for_each(distances.grid, [=](index<2> idx) restrict(direct3d){
    double sqrsum=0;
    double tempd=0;

    for ( unsigned int i=0 ; i<vs ; ++i)
    {
        tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
        sqrsum += tempd*tempd;
    }
    distances[idx]=sqrsum;
}

However, as you can see, this does not take into account the symmetry property of distance matrices. When I calculate sqrsum of matrices i and j, I don't want to do the same calculation again when the order of the i and j are reversed. Is there any way to accomplish this? I came up with the following trick, but I don't know if this would bump up the performance significantly

    for ( unsigned int i=0 ; i<vs ; ++i)
    {
        if(idx.x<=idx.y){
            break;
        }

        tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
        sqrsum += tempd*tempd;
    }

Can the if-condition do the job? Or do you think the if statement would hurt the performance unnecessarily? I couldn't came up with any alternative to it

BTW, I just noticed that the above written code does not work on my machine, whose gpu only supports single precision. Is there anything to do to get around that problem? Error message is as follows: "runtime_exception: Concurrency;;parallel_for_each uses features unsupported by the selected accelerator. ID3D11Device::CreateComputeShader: Shader uses double precision float ops which are not supported on the current device."

I think you can eliminate if-condition, if you would schedule only as many threads as you need, instead of scheduling entire rectangle that covers your output matrix. What you need is upper or lower triangle without diagonal, which you can calculate using arithmetic sequence.

The alternative would be to organize input data such that it is in two 1D vectors, each thread would read value from vector 1, then vector 2 and calculate distance and store it in one of the input vectors.

Finally, the error on double precision shows up, because the card you are using does not support double precision operations. Please check your card specification to confirm that. You can workaround it by switching to single precision type i.e. "float" in array_view template.