Let p
be a matrix of first set of locations where each row gives the coordinates of a particular point. Similarly, let q
be a matrix of second set of locations where each row gives the coordinates of a particular point.
Then formula for pairwise squared Euclidean distance is:
k(i,j) = (p(i,:) - q(j,:))*(p(i,:) - q(j,:))',
where p(i,:)
denotes i
-th row of matrix p
, and p'
denotes the transpose of p
.
I would like to compute matrix k
on CUDA-enabled GPU (NVidia Tesla) in C++. I have OpenCV v.2.4.1 with GPU support but I'm open to other alternatives, like Thrust library. However, I'm not too familiar with GPU programming. Can you suggest an efficient way to accomplish this task? What C++ libraries should I use?
The problem looks simple enough to make a library overkill.
Without knowing the range of i
and j
, I'd suggest you partition k
into blocks of a multiple of 32 threads each and in each block, compute
float sum, myp[d];
int i = blockIdx.x*blockDim.x + threadIdx.x;
for ( int kk = 0 ; kk < d ; kk++ )
myp[kk] = p(i,kk);
for ( j = blockIdx.y*blockDim.y ; j < (blockIdx.y+1)*blockDim ; j++ ) {
sum = 0.0f;
#pragma unroll
for ( int kk = 0 ; kk < d ; kk++ ) {
temp = myp[kk] - q(j,kk);
sum += temp*temp;
}
k(i,j) = sum;
}
where I am assuming that your data has d
dimensions and writing p(i,k)
, q(j,k)
and k(i,j)
to mean an access to a two-dimensional array. I also took the liberty in assuming that your data is of type float
.
Note that depending on how k
is stored, e.g. row-major or column-major, you may want to loop over i
per thread instead to get coalesced writes to k
.