multiprocessing.Pool() slower than just using ordi

(This question is about how to make multiprocessing.Pool() run code faster. I finally solved it, and the final solution can be found at the bottom of the post.)

Original Question:

I'm trying to use Python to compare a word with many other words in a list and retrieve a list of the most similar ones. To do that I am using the difflib.get_close_matches function. I'm on a relatively new and powerful Windows 7 Laptop computer, with Python 2.6.5.

What I want is to speed up the comparison process because my comparison list of words is very long and I have to repeat the comparison process several times. When I heard about the multiprocessing module it seemed logical that if the comparison could be broken up into worker tasks and run simultaneously (and thus making use of machine power in exchange for faster speed) my comparison task would finish faster.

However, even after having tried many different ways, and used methods that have been shown in the docs and suggested in forum posts, the Pool method just seems to be incredibly slow, much slower than just running the original get_close_matches function on the entire list at once. I would like help understanding why Pool() is being so slow and if I am using it correctly. Im only using this string comparison scenario as an example because that is the most recent example I could think of where I was unable to understand or get multiprocessing to work for rather than against me. Below is just an example code from the difflib scenario showing the time differences between the ordinary and the Pooled methods:

from multiprocessing import Pool
import random, time, difflib

# constants
wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(1000000)]
mainword = "hello"

# comparison function
def findclosematch(subwordlist):
    matches = difflib.get_close_matches(mainword,subwordlist,len(subwordlist),0.7)
    if matches <> []:
        return matches

# pool
print "pool method"
if __name__ == '__main__':
    pool = Pool(processes=3)
    t=time.time()
    result = pool.map_async(findclosematch, wordlist, chunksize=100)
    #do something with result
    for r in result.get():
        pass
    print time.time()-t

# normal
print "normal method"
t=time.time()
# run function
result = findclosematch(wordlist)
# do something with results
for r in result:
    pass
print time.time()-t

The word to be found is "hello", and the list of words in which to find close matches is a 1 million long list of 5 randomly joined characters (only for illustration purposes). I use 3 processor cores and the map function with a chunksize of 100 (listitems to be procesed per worker I think??) (I also tried chunksizes of 1000 and 10 000 but there was no real difference). Notice that in both methods I start the timer right before calling on my function and end it right after having looped through the results. As you can see below the timing results are clearly in favor of the original non-Pool method:

>>> 
pool method
37.1690001488 seconds
normal method
10.5329999924 seconds
>>>

The Pool method is almost 4 times slower than the original method. Is there something I am missing here, or maybe misunderstanding about how the Pooling/multiprocessing works? I do suspect that part of the problem here could be that the map function returns None and so adds thousands of unneccessary items to the resultslist even though I only want actual matches to be returned to the results and have written it as such in the function. From what I understand that is just how map works. I have heard about some other functions like filter that only collects non-False results, but I dont think that multiprocessing/Pool supports the filter method. Are there any other functions besides map/imap in the multiprocessing module that could help me out in only returning what my function returns? Apply function is more for giving multiple arguments as I understand it.

I know there's also the imap function, which I tried but without any time-improvements. The reason being the same reason why I have had problems understanding what's so great about the itertools module, supposedly "lightning fast", which I've noticed is true for calling the function, but in my experience and from what I've read that's because calling the function doesn't actually do any calculations, so when it's time to iterate through the results to collect and analyze them (without which there would be no point in calling the cuntion) it takes just as much or sometimes more time than a just using the normal version of the function straightup. But I suppose that's for another post.

Anyway, excited to see if someone can nudge me in the right direction here, and really appreciate any help on this. I'm more interested in understanding multiprocessing in general than to get this example to work, though it would be useful with some example solution code suggestions to aid in my understanding.

The Answer:

Seems like the slowdown had to do with the slow startup time of additional processes. I couldnt get the .Pool() function to be fast enough. My final solution to make it faster was to manually split the workload list, use multiple .Process() instead of .Pool(), and return the solutions in a Queue. But I wonder if maybe the most crucial change might have been splitting the workload in terms of the main word to look for rather than the words to compare with, perhaps because the difflib search function is already so fast. Here is the new code running 5 processes at the same time, and turned out about x10 faster than running a simple code (6 seconds vs 55 seconds). Very useful for fast fuzzy lookups, on top of how fast difflib already is.

from multiprocessing import Process, Queue
import difflib, random, time

def f2(wordlist, mainwordlist, q):
    for mainword in mainwordlist:
        matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7)
        q.put(matches)

if __name__ == '__main__':

    # constants (for 50 input words, find closest match in list of 100 000 comparison words)
    q = Queue()
    wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(100000)]
    mainword = "hello"
    mainwordlist = [mainword for each in xrange(50)]

    # normal approach
    t = time.time()
    for mainword in mainwordlist:
        matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7)
        q.put(matches)
    print time.time()-t

    # split work into 5 or 10 processes
    processes = 5
    def splitlist(inlist, chunksize):
        return [inlist[x:x+chunksize] for x in xrange(0, len(inlist), chunksize)]
    print len(mainwordlist)/processes
    mainwordlistsplitted = splitlist(mainwordlist, len(mainwordlist)/processes)
    print "list ready"

    t = time.time()
    for submainwordlist in mainwordlistsplitted:
        print "sub"
        p = Process(target=f2, args=(wordlist,submainwordlist,q,))
        p.Daemon = True
        p.start()
    for submainwordlist in mainwordlistsplitted:
        p.join()
    print time.time()-t
    while True:
        print q.get()

回答1:

My best guess is inter-process communication (IPC) overhead. In the single-process instance, the single process has the word list. When delegating to various other processes, the main process needs to constantly shuttle sections of the list to other processes.

Thus, it follows that a better approach might be to spin off n processes, each of which is responsible for loading/generating 1/n segment of the list and checking if the word is in that part of the list.

I'm not sure how to do that with Python's multiprocessing library, though.

回答2:

These problems usually boil down to the following:

The function you are trying to parallelize doesn't require enough CPU resources (i.e. CPU time) to rationalize parallelization!

Sure, when you parallelize with multiprocessing.Pool(8), you theoretically (but not practically) could get a 8x speed up.

However, keep in mind that this isn't free - you gain this parallelization at the expense of the following overhead:

Creating a task for every chunk (of size chunksize) in your iter passed to Pool.map(f, iter)
For each task
1. Serialize the task, and the task's return value (think pickle.dumps())
2. Deserialize the task, and the task's return value (think pickle.loads())
3. Waste significant time waiting for Locks on shared memory Queues, while worker processes and parent processes get() and put() from/to these Queues.
One-time cost of calls to os.fork() for each worker process, which is expensive.

In essence, when using Pool() you want:

High CPU resource requirements
Low data footprint passed to each function call
Reasonably long iter to justify the one-time cost of (3) above.

For a more in-depth exploration, this post and linked talk walk-through how large data being passed to Pool.map() (and friends) gets you into trouble.

Raymond Hettinger also talks about proper use of Python's concurrency here.

回答3:

I experienced something similar with the Pool on a different problem. I'm not sure of the actual cause at this point...

The Answer edit by OP Karim Bahgat is the same solution that worked for me. After switching to a Process & Queue system, I was able to see speedups inline with the number of cores for a machine.

Here's an example.

def do_something(data):
    return data * 2

def consumer(inQ, outQ):
    while True:
        try:
            # get a new message
            val = inQ.get()

            # this is the 'TERM' signal
            if val is None:
                break;

            # unpack the message
            pos = val[0]  # its helpful to pass in/out the pos in the array
            data = val[1]

            # process the data
            ret = do_something(data)

            # send the response / results
            outQ.put( (pos, ret) )


        except Exception, e:
            print "error!", e
            break

def process_data(data_list, inQ, outQ):
    # send pos/data to workers
    for i,dat in enumerate(data_list):
        inQ.put( (i,dat) )

    # process results
    for i in range(len(data_list)):
        ret = outQ.get()
        pos = ret[0]
        dat = ret[1]
        data_list[pos] = dat


def main():
    # initialize things
    n_workers = 4
    inQ = mp.Queue()
    outQ = mp.Queue()
    # instantiate workers
    workers = [mp.Process(target=consumer, args=(inQ,outQ))
               for i in range(n_workers)]

    # start the workers
    for w in workers:
        w.start()

    # gather some data
    data_list = [ d for d in range(1000)]

    # lets process the data a few times
    for i in range(4):
        process_data(data_list)

    # tell all workers, no more data (one msg for each)
    for i in range(n_workers):
        inQ.put(None)
    # join on the workers
    for w in workers:
        w.join()

    # print out final results  (i*16)
    for i,dat in enumerate(data_list):
        print i, dat