I'm performing natural language processing using NLTK on some fairly large datasets and would like to take advantage of all my processor cores. Seems the multiprocessing module is what I'm after, and when I run the following test code I see all cores are being utilized, but the code never completes.
Executing the same task, without multiprocessing, finishes in approximately one minute.
Python 2.7.11 on debian.
from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp
def open_file(filepath):
#open and parse file
file = io.open(filepath, 'rU', encoding='utf-8')
text = file.read()
return text
def mp_word_tokenize(text_to_process):
#word tokenize
start_time = time.clock()
pool = mp.Pool(processes=8)
word_tokens = pool.map(word_tokenize, text_to_process)
finish_time = time.clock() - start_time
print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
return word_tokens
filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)
DEPRECATED
This answer is outdated. Please see https://stackoverflow.com/a/54032108/610569 instead
Here's a cheater's way to do multi-threading using
sframe
:Note that the speed difference might be because I have something else running on the other cores. But given a much larger dataset and dedicated cores, you can really see this scale.
It has been a couple of years and
SFrame
has seen moved on to become part ofturicreate
:And the speed up is sort of significant from using the new
SFrame
(in Python3).In native Python and NLTK:
[out]:
With SFrame
[out]:
Note: SFrame is lazily evaluated,
.materialize()
forces the persistence of the SFrame to disk, committing all lazy evaluated operations.With Joblib
Additionally, you can use the "embarrassingly simple" parallelization
joblib
:[out]: