I have a code that parses quite big amount of XML files (using xml.sax library) to extract data for future machine learning. I want the parsing part to run in parallel (I have 24 cores on a server doing also some web services, so I decided to use 20 of them). After the parsing I want to merge the results. The following code should do (and is doing) exactly what I expected, but there is a problem with the parallel thing.
def runParse(fname):
parser = make_parser()
handler = MyXMLHandler()
parser.setContentHandler(handler)
parser.parse(fname)
return handler.getResult()
def makeData(flist, tasks=20):
pool = Pool(processes=tasks)
tmp = pool.map(runParse, flist)
for result in tmp:
# and here the merging part
When this part starts it runs for a while on 20 cores and then goes to only one, and it happens before the merging part (which will of course run on only one core).
Can anyone help to solve this problem or suggest a way to speed up the program?
Thanks!
ppiikkaaa
Why do you say it goes to only one before completing?
You're using
.map()
which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.You can try using
.imap()
which is the iterator version on.map()
or even the.imap_unordered()
if the order of analysis is not important (as it seems from your example).Here's the relevant documentation. Worth noting the line: