Predicting runtime of parallel loop using a-priori

I'm working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with known sparsity structure).

After a lot of pre-processing, I end up with a number of different blocks (greater than, say, 200), for which I want to calculate selected entries.

One of the pre-processing steps is to determine the (number of) entries per block I want to calculate, which gives me an almost perfect measure of the amount of time each block will take (for all intents and purposes the quadrature effort is the same for each entry).

Thanks to https://stackoverflow.com/a/9938666/2965879, I was able to make use of this by ordering the blocks in reverse order, thus goading MATLAB into starting with the biggest ones first.

However, the number of entries differs so wildly from block to block, that directly running parfor is limited severely by the blocks with the largest number of entries, even if they are fed into the loop in reverse.

My solution is to do the biggest blocks serially (but parallelised on the level of entries!), which is fine as long as the overhead per iterand doesn't matter too much, resp. the blocks don't get too small. The rest of the blocks I then do with parfor. Ideally, I'd let MATLAB decide how to handle this, but since a nested parfor-loop loses its parallelism, this doesn't work. Also, packaging both loops into one is (nigh) impossible.

My question now is about how to best determine this cut-off between the serial and the parallel regime, taking into account the information I have on the number of entries (the shape of the curve of ordered entries may differ for different problems), as well as the number of workers I have available.

So far, I had been working with the 12 workers available under a the standard PCT license, but since I've now started working on a cluster, determining this cut-off becomes more and more crucial (since for many cores the overhead of the serial loop becomes more and more costly in comparison to the parallel loop, but similarly, having blocks which hold up the rest are even more costly).

For 12 cores (resp. the configuration of the compute server I was working with), I had figured out a reasonable parameter of 100 entries per worker as a cut off, but this doesn't work well when the number of cores isn't small anymore in relation to the number of blocks (e.g 64 vs 200).

I've tried to deflate the number of cores with different powers (e.g. 1/2, 3/4), but this also doesn't work consistently. Next I tried to group the blocks into batches and determine the cut-off when entries are larger than the mean per batch, resp. the number of batches they are away from the end:

logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
    i = i+1;
    m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order 
    logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;  
        % if the small blocks were parallelised perfectly, i.e. all  
        % cores take the same time, the time would be proportional to  
        % i*m. To try to discount the different sizes (and imperfect  
        % parallelisation), we only scale with a power of i less than  
        % one to not end up with a few blocks which hold up the rest  
end  
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);

(Note: This code doesn't work for vectors num_entr_asc whose length is not a multiple of num_core, but I decided to omit the min(...,end) constructions for legibility.)

I've also omitted the < max(...,...) for combining both conditions (i.e. together with minimum entries per worker), which is necessary so that the cut-off isn't found too early. I thought a little about somehow using the variance as well, but so far all attempts have been unsatisfactory.

I'd be very grateful if someone has a good idea for how to solve this.
Thanks for reading this very long question,
Best regards,
Axel

Ps. Since my "Dear stackoverflow," seems to be filtered, let me say thanks for the numerous times I've already found the solution to a question of mine here.

I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.

Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:

function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   c: Estimated cost of parallel computation

num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);

i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
    i=i+1;
    [~,i_min]=min(c); % which core finished first; is fed with next block
    c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end

c=max(c);

end

The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.

I use the above routine to determine the cut-off in the following way:

% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   i: Number of blocks to be calculated serially

num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);

for i=0:num_blocks
    cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
    cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end

[~,i]=min(sum(cost,2));
i=i-1;

end

In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.

Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).

Compared to the approach in my wordy question I see two main advantages:

The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).

Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.

Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.