Python multiprocessing - Pool.map running only one

2019-08-05 13:33发布

问题:

I have a code that parses quite big amount of XML files (using xml.sax library) to extract data for future machine learning. I want the parsing part to run in parallel (I have 24 cores on a server doing also some web services, so I decided to use 20 of them). After the parsing I want to merge the results. The following code should do (and is doing) exactly what I expected, but there is a problem with the parallel thing.

def runParse(fname):
    parser = make_parser()
    handler = MyXMLHandler()
    parser.setContentHandler(handler)
    parser.parse(fname)
    return handler.getResult()

def makeData(flist, tasks=20):
    pool = Pool(processes=tasks)
    tmp = pool.map(runParse, flist)
    for result in tmp:
        # and here the merging part

When this part starts it runs for a while on 20 cores and then goes to only one, and it happens before the merging part (which will of course run on only one core).

Can anyone help to solve this problem or suggest a way to speed up the program?

Thanks!

ppiikkaaa

回答1:

Why do you say it goes to only one before completing?

You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.

You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of analysis is not important (as it seems from your example).

Here's the relevant documentation. Worth noting the line:

For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.