Where is the memory leak? How to timeout threads d

2020-08-13 04:58发布

问题:

It is unclear how to properly timeout workers of joblib's Parallel in python. Others have had similar questions here, here, here and here.

In my example I am utilizing a pool of 50 joblib workers with threading backend.

Parallel Call (threading):

output = Parallel(n_jobs=50, backend  = 'threading')
    (delayed(get_output)(INPUT) 
        for INPUT in list)

Here, Parallel hangs without errors as soon as len(list) <= n_jobs but only when n_jobs => -1.

In order to circumvent this issue, people give instructions on how to create a timeout decorator to the Parallel function (get_output(INPUT)) in the above example) using multiprocessing:

Main function (decorated):

@with_timeout(10)    # multiprocessing
def get_output(INPUT):     # threading
    output = do_stuff(INPUT)
    return output

Multiprocessing Decorator:

def with_timeout(timeout):
    def decorator(decorated):
        @functools.wraps(decorated)
        def inner(*args, **kwargs):
            pool = multiprocessing.pool.ThreadPool(1)
            async_result = pool.apply_async(decorated, args, kwargs)
            try:
                return async_result.get(timeout)
            except multiprocessing.TimeoutError:
                return
        return inner
    return decorator

Adding the decorator to the otherwise working code results in a memory leak after ~2x the length of the timeout plus a crash of eclipse.

Where is this leak in the decorator?

How to timeout threads during multiprocessing in python?

回答1:

It is not possible to kill a Thread in Python without a hack.

The memory leak you are experiencing is due to the accumulation of threads you believe they have been killed. To prove that, just try to inspect the amount of threads your application is running, you will see them slowly growing.

Under the hood, the thread of the ThreadPool is not terminated but keeps running your function until the end.

The reason why a Thread cannot be killed, is due to the fact that threads share memory with the parent process. Therefore, it is very hard to kill a thread while ensuring the memory integrity of your application.

Java developers figured it out long ago.

If you can run your function in a separate process, then you could easily rely on a timeout logic where the process itself is killed once the timeout is reached.

The Pebble library already offers decorators with timeout.