-->

Out-of-memory error in parfor: kill the slave, not

2020-03-26 04:20发布

问题:

When an out-of-memory error is raised in a parfor, is there any way to kill only one Matlab slave to free some memory instead of having the entire script terminate?

Here is what happens by default when an out-of-memory error occurs in a parfor: the script terminated, as shown in the screenshot below.

I wish there was a way to just kill one slave (i.e. removing a worker from parpool) or stop using it to release as much memory as possible from it:

回答1:

If you get a out of memory in the master process there is no chance to fix this. For out of memory on the slave, this should do it:

The simple idea of the code: Restart the parfor again and again with the missing data until you get all results. If one iteration fails, a flag (file) is written which let's all iterations throw an error as soon as the first error occurred. This way we get "out of the loop" without wasting time producing other out of memory.

%Your intended iterator
iterator=1:10;
%flags which indicate what succeeded
succeeded=false(size(iterator));
%result array
result=nan(size(iterator));
FLAG='ANY_WORKER_CRASHED';
while ~all(succeeded)
    fprintf('Another try\n')
    %determine which iterations should be done
    todo=iterator(~succeeded);
    %initialize array for the remaining results
    partresult=nan(size(todo));
    %initialize flags which indicate which iterations succeeded (we can not
    %throw erros, it throws aray results)
    partsucceeded=false(size(todo));
    %flag indicates that any worker crashed. Have to use file based
    %solution, don't know a better one. #'
    delete(FLAG);
    try
    parfor falseindex=1:sum(~succeeded)
        realindex=todo(falseindex);
        try
            % The flag is used to let all other workers jump out of the
            % loop as soon as one calculation has crashed.
            if exist(FLAG,'file')
                error('some other worker crashed');
            end
            % insert your code here
            %dummy code which randomly trowsexpection
            if rand<.5
                error('hit out of memory')
            end
            partresult(falseindex)=realindex*2
            % End of user code
            partsucceeded(falseindex)=true;
            fprintf('trying to run %d and succeeded\n',realindex)
        catch ME
            % catch errors within workers to preserve work
            partresult(falseindex)=nan
            partsucceeded(falseindex)=false;
            fprintf('trying to run %d but it failed\n',realindex)
            fclose(fopen(FLAG,'w'));
        end
    end
    catch
        %reduce poolsize by 1
        newsize = matlabpool('size')-1;
        matlabpool close
        matlabpool(newsize) 
    end
    %put the result of the current iteration into the full result
    result(~succeeded)=partresult;
    succeeded(~succeeded)=partsucceeded;
end


回答2:

After quite bit of research, and a lot of trial and error, I think I may have a decent, compact answer. What you're going to do is:

  1. Declare some max memory value. You can set it dynamically using the MATLAB function memory, but I like to set it directly.
  2. Call memory inside your parfor loop, which returns the memory information for that particular worker.
  3. If the memory used by the worker exceeds the threshold, cancel the task that worker was working on. Now, here it get's a bit tricky. Depending on the way you're using parfor, you'll either need to delete or cancel either the task or worker. I've verified that it works with the code below when there is one task per worker, on a remote cluster.

Insert the following code at the beginning of your parfor contents. Tweak as necessary.

memLimit = 280000000;    %// This doesn't have to be in parfor. Everything else does.

memData = memory;
if memData.MemUsedMATLAB > memLimit
    task = getCurrentTask();
    cancel(task);
end

Enjoy! (Fun question, by the way.)



回答3:

One other option to consider is that since R2013b, you can open a parallel pool with 'SpmdEnabled' set to false - this allows MATLAB worker processes to die without the whole pool being shut down - see the doc here http://www.mathworks.co.uk/help/distcomp/parpool.html . Of course, you still need to arrange somehow to shutdown the workers.