I have a code that parses quite big amount of XML files (using xml.sax library) to extract data for future machine learning. I want the parsing part to run in parallel (I have 24 cores on a server doing also some web services, so I decided to use 20 of them). After the parsing I want to merge the results. The following code should do (and is doing) exactly what I expected, but there is a problem with the parallel thing.
def runParse(fname):
parser = make_parser()
handler = MyXMLHandler()
parser.setContentHandler(handler)
parser.parse(fname)
return handler.getResult()
def makeData(flist, tasks=20):
pool = Pool(processes=tasks)
tmp = pool.map(runParse, flist)
for result in tmp:
# and here the merging part
When this part starts it runs for a while on 20 cores and then goes to only one, and it happens before the merging part (which will of course run on only one core).
Can anyone help to solve this problem or suggest a way to speed up the program?
Thanks!
ppiikkaaa