python multiprocessing.Pool kill *specific* long r

2019-02-09 17:47发布

I need to execute a pool of many parallel database connections and queries. I would like to use a multiprocessing.Pool or concurrent.futures ProcessPoolExecutor. Python 2.7.5

In some cases, query requests take too long or will never finish (hung/zombie process). I would like to kill the specific process from the multiprocessing.Pool or concurrent.futures ProcessPoolExecutor that has timed out.

Here is an example of how to kill/re-spawn the entire process pool, but ideally I would minimize that CPU thrashing since I only want to kill a specific long running process that has not returned data after timeout seconds.

For some reason the code below does not seem to be able to terminate/join the process Pool after all results are returned and completed. It may have to do with killing worker processes when a timeout occurs, however the Pool creates new workers when they are killed and results are as expected.

from multiprocessing import Pool
import time
import numpy as np
from threading import Timer
import thread, time, sys

def f(x):
    time.sleep(x)
    return x

if __name__ == '__main__':
    pool = Pool(processes=4, maxtasksperchild=4)

    results = [(x, pool.apply_async(f, (x,))) for x in np.random.randint(10, size=10).tolist()]

    while results:
        try:
            x, result = results.pop(0)
            start = time.time()
            print result.get(timeout=5), '%d done in %f Seconds!' % (x, time.time()-start)

        except Exception as e:
            print str(e)
            print '%d Timeout Exception! in %f' % (x, time.time()-start)
            for p in pool._pool:
                if p.exitcode is None:
                    p.terminate()

    pool.terminate()
    pool.join()

4条回答
beautiful°
2楼-- · 2019-02-09 18:01

To avoid access to the internal variables you can save multiprocessing.current_process().pid from the executing task into the shared memory. Then iterate over the multiprocessing.active_children() from the main process and kill the target pid if exists.
However, after such external termination of the workers, they are recreated, but the pool becomes nonjoinable and also requires explicit termination before the join()

查看更多
再贱就再见
3楼-- · 2019-02-09 18:04

In your solution you're tampering internal variables of the pool itself. The pool is relying on 3 different threads in order to correctly operate, it is not safe to intervene in their internal variables without being really aware of what you're doing.

There's not a clean way to stop timing out processes in the standard Python Pools, but there are alternative implementations which expose such feature.

You can take a look at the following libraries:

pebble

billiard

查看更多
你好瞎i
4楼-- · 2019-02-09 18:20

I also came across this problem.

The original code and the edited version by @stacksia has the same issue: in both cases it will kill all currently running processes when timeout is reached for just one of the processes (ie when the loop over pool._pool is done).

Find below my solution. It involves creating a .pid file for each worker process as suggested by @luart. It will work if there is a way to tag each worker process (in the code below, x does this job). If someone has a more elegant solution (such as saving PID in memory) please share it.

#!/usr/bin/env python

from multiprocessing import Pool
import time, os
import subprocess

def f(x):
    PID = os.getpid()
    print 'Started:', x, 'PID=', PID

    pidfile = "/tmp/PoolWorker_"+str(x)+".pid"

    if os.path.isfile(pidfile):
        print "%s already exists, exiting" % pidfile
        sys.exit()

    file(pidfile, 'w').write(str(PID))

    # Do the work here
    time.sleep(x*x)

    # Delete the PID file
    os.remove(pidfile)

    return x*x


if __name__ == '__main__':
    pool = Pool(processes=3, maxtasksperchild=4)

    results = [(x, pool.apply_async(f, (x,))) for x in [1,2,3,4,5,6]]

    pool.close()

    while results:
        print results
        try:
            x, result = results.pop(0)
            start = time.time()
            print result.get(timeout=3), '%d done in %f Seconds!' % (x, time.time()-start)

        except Exception as e:
            print str(e)
            print '%d Timeout Exception! in %f' % (x, time.time()-start)

            # We know which process gave us an exception: it is "x", so let's kill it!

            # First, let's get the PID of that process:
            pidfile = '/tmp/PoolWorker_'+str(x)+'.pid'
            PID = None
            if os.path.isfile(pidfile):
                PID = str(open(pidfile).read())
                print x, 'pidfile=',pidfile, 'PID=', PID

            # Now, let's check if there is indeed such process runing:
            for p in pool._pool:
                print p, p.pid
                if str(p.pid)==PID:
                    print 'Found  it still running!', p, p.pid, p.is_alive(), p.exitcode

                    # We can also double-check how long it's been running with system 'ps' command:"
                    tt = str(subprocess.check_output('ps -p "'+str(p.pid)+'" o etimes=', shell=True)).strip()
                    print 'Run time from OS (may be way off the real time..) = ', tt

                    # Now, KILL the m*$@r:
                    p.terminate()
                    pool._pool.remove(p)
                    pool._repopulate_pool()

                    # Let's not forget to remove the pidfile
                    os.remove(pidfile)

                    break

    pool.terminate()
    pool.join()

Many people suggest pebble. It looks nice, but only available for Python 3. If someone has a way to get pebble imported for python 2.6 - would be great.

查看更多
乱世女痞
5楼-- · 2019-02-09 18:21

I am not fully understanding your question. You say you want to stop one specific process, but then, in your exception handling phase, you are calling terminate on all jobs. Not sure why you are doing that. Also, I am pretty sure using internal variables from multiprocessing.Pool is not quite safe. Having said all of that, I think your question is why this program does not finish when a time out happens. If that is the problem, then the following does the trick:

from multiprocessing import Pool
import time
import numpy as np
from threading import Timer
import thread, time, sys

def f(x):
    time.sleep(x)
    return x

if __name__ == '__main__':
    pool = Pool(processes=4, maxtasksperchild=4)

    results = [(x, pool.apply_async(f, (x,))) for x in np.random.randint(10, size=10).tolist()]

    result = None
    start = time.time()
    while results:
        try:
            x, result = results.pop(0)
            print result.get(timeout=5), '%d done in %f Seconds!' % (x, time.time()-start)
        except Exception as e:
            print str(e)
            print '%d Timeout Exception! in %f' % (x, time.time()-start)
            for i in reversed(range(len(pool._pool))):
                p = pool._pool[i]
                if p.exitcode is None:
                    p.terminate()
                del pool._pool[i]

    pool.terminate()
    pool.join()

The point is you need to remove items from the pool; just calling terminate on them is not enough.

查看更多
登录 后发表回答