I have a very long list of tweets (2 millions) and I use regexes to search and replace text in these tweets.
I run this using a joblib.Parallel map (joblib
is the parallel backend used by scikit-learn).
My problem is that I can see in Windows' Task Manager that my script does not use 100% of each CPU. It doesn't use 100% of the RAM nor the disk. So I don't understand why it won't go faster.
There is probably synchronization delays somewhere but I can't find what nor where.
The code:
# file main.py
import re
from joblib import delayed, Parallel
def make_tweets():
tweets = load_from_file() # this is list of strings
regex = re.compile(r'a *a|b *b') # of course more complex IRL, with lookbehind/forward
mydict = {'aa': 'A', 'bb': 'B'}
def handler(match):
return mydict[match[0].replace(' ', '')]
def replace_in(tweet)
return re.sub(regex, handler, tweet)
# -1 mean all cores
# I have 6 cores that can run 12 threads
with Parallel(n_jobs=-1) as parallel:
tweets2 = parallel(delayed(replace_in)(tweet) for tweet in tweets)
return tweets2
And here's the Task Manager:
Edit: final word
The answer is that the worker processes were slowed down by joblib
synchronization: joblib sends the tweets in small chunks (one by one?) to the workers, which makes them wait. Using multiprocessing.Pool.map
with a chunksize of len(tweets)/cpu_count()
made the workers utilize 100% of the CPU.
Using joblib
, the running time was around 12mn. Using multiprocessing it is 4mn. With multiprocessing
, Each worker thread consumed around 50mb memory.
after a bit of playing I think it's because
joblib
is spending all its time coordinating parallel running of everything and no time actually doing any useful work. at least for me under OSX and Linux — I don't have any MS Windows machinesI started by loading packages, pulling in your code, and generating a dummy file:
(see if you can guess where I got the lines from!)
as a baseline, I tried using naive solution:
this takes 14.0s (wall clock time) on my OSX laptop, and 5.15s on my Linux desktop. Note that changing your definition of
replace_in
to use doregex.sub(handler, tweet)
instead ofre.sub(regex, handler, tweet)
reduces the above to 8.6s on my laptop, but I'll not use this change below.I then tried your
joblib
package:which takes 1min 16s on my laptop, and 34.2s on my desktop. CPU utilisation was pretty low as the child/worker tasks were all waiting for the coordinator to send them work most of the time.
I then tried using the
multiprocessing
package:which took 5.95s on my laptop and 2.60s on my desktop. I also tried with a chunk size of 8 which took 22.1s and 8.29s respectively. the chunk size allows the pool to send large chunks of work to its children, so it can spend less time coordinating and more time getting useful work done.
I'd therefore hazard a guess that
joblib
isn't particularly useful for this sort of usage as it doesn't seem to have a notion of chunksize.