GPU optimization for vectorized code

function w=oja(X, varargin)

% get the dimensionality
[m n] = size(X);

% random initial weights
w = randn(m,1);

options = struct( ...
    'rate', .00005, ...
    'niter', 5000, ...
    'delta', .0001);
options = getopt(options, varargin);
success = 0;

% run through all input samples
for iter = 1:options.niter
    y = w'*X;
    for ii = 1:n       
        % y is a scalar, not a vector
        w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
    end
end
if (any(~isfinite(w)))
    warning('Lost convergence; lower learning rate?');
end

end

size(X)= 400 153600

This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed

X=gpuArray(X)

But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.

Profile Code Output:

Complete details: https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing

This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.

GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.

Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)

Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).

And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).

Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.

Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).

So, as far I can say, you will need to deal with it.