Vectorization of min distance in kernel

I have an Nx2 array K1 with the location of N keypoints and a 3 dimensional WxHx3 array Kart1(width,height,coordinates) that maps coordinates to every pixel of an image. For every keypoint in K1 I want to read the location of the pixel in Kart1 and evaluate the coordinates (search for the min/max or calculate the mean) in a 3x3 kernel around it and assign a value to the current pixel in KPCoor1.

My current approach looks like this:

for ii=1:length(K1(:,1)) %for every keypoint in K1

    MinDist=sqrt(sum(Kart1(K1(ii,2)-1,K1(ii,1)-1,:).^2)); %Calculate distance
    xShift=0;
    yShift=0;
    for kk=-1:1 %for every pixel in a 3x3 kernel...
        for ll=-1:1

            Distance=sqrt(sum(Kart1(K1(ii,2)+kk,K1(ii,1)+ll,:).^2));

            if Distance<MinDist   %... if the current distance is smaller than MinDist
                MinDist=Distance; %... update MinDist...
                xShift=kk; %... and take the kernel coordinate of the pixel
                yShift=ll;
            end

        end
    end

    KP1Coor(ii,:)=Kart1(K1(ii,2)+xShift,K1(ii,1)+yShift,:); %assign the coordinates of the pixel with the minimal distance in kernel.

end

and it runs, but is ugly and I doubt it's doing what I want to do. I am a bit confused by the "multidimensionality" of the matter, don't know many functions to evaluate kernels, and can't think of a way to use vectorization functions like bsxfun() or logic operations (means I'm stuck and my brain is dry :/)

Any suggestions on how to eliminate those loops/correct the code?

Vectorized approach: After diving into the vectorized implementation, it looked like a very interesting look-up problem, so if you are still interested in vectorized techniques to get the job done, here's an approach with bsxfun -

%// Scalars to replace their repeated usages in the code
[W,H,NC]= size(Kart1);
d3a = 1:NC;

%// Indices for Kart1 at the first level of the nested loops
BI = bsxfun(@plus,K1(:,2)+(K1(:,1)-1)*W,(d3a-1)*W*H);

%// Indices for Kart1 in 3x3 kernel around the indices obtained at first level
BIW3 = bsxfun(@plus,[-1 0 1]',[-W 0 W]); %//'
%// Linear indices for the minimum value of Kart1 in the 3x3 neighborhood 
[~,MI3] = min(sqrt(sum(Kart1(bsxfun(@plus,...
    BI,permute(BIW3(:),[3 2 1]))).^2,2)),[],3);
%// X-Y indices
[xShift1,yShift1] = ind2sub([3 3],MI3);

%// Get Kart1 values corresponding to the indices for the minimum values
KP1Coor = Kart1(bsxfun(@plus,...
    K1(:,2)+xShift1-2 +(K1(:,1)+yShift1-2-1)*W,(d3a-1)*W*H));

Benchmarking

I was able to test it out with GPU too using gpuArray from Parallel Computing Toolbox and then run some benchmarks with W = 1000, H = 1000 and used N as the datasize varying it with [1000 2000 5000 10000 20000]. The results seem crazy enough, though not unreliable as approved benchmarking methods were used from Measure and Improve GPU Performance. Here is the benchmark plot for original and CPU and GPU vectorized codes -

It looked appropriate to then benchmark only the vectorized codes and for even larger datasizes, whose plot is shown next -

Conclusions: The problem basically looks like a look-up problem, where Kart1 is the data and K1 is the array of indices to look-up for. The vectorized solution presented here is basically a brute-force approach and the benchmark results clearly favor this for performance. But, it would be interesting to see if any non-brute-force approach that might be even loop-based, but leverages this looking up efficiently could perform better.