Multiprocessing Pool slow when calling external mo

My script is calling librosa module to compute Mel-frequency cepstral coefficients (MFCCs) for short pieces of audio. After loading the audio, I'd like to compute these (along with some other audio features) as fast as possible - hence multiprocessing.

Problem: multiprocessing variant is much slower than sequential. Profiling says my code spends over 90% of the time on <method 'acquire' of '_thread.lock' objects>. It's not surprising if it were many small tasks, but in one test case, I am dividing my audio into 4 chunks and process then in separate processes. I was thinking overhead should be minimal, but in practice, it's almost as bad as with many small tasks.

To my understanding, multiprocessing module should fork almost everything and there should not be any fighting for a lock. However, the results seem to show something different. Could it be that librosa module underneath keeps some sort of internal lock?

My profiling results in plain text: https://drive.google.com/open?id=17DHfmwtVOJOZVnwIueeoWClUaWkvhTPc

As an image: https://drive.google.com/open?id=1KuZyo0CurHd9GjXge5CYQhdWn2Q6OG8Z

The code to reproduce the "problem":

import time
import numpy as np
import librosa
from functools import partial
from multiprocessing import Pool

n_proc = 4

y, sr = librosa.load(librosa.util.example_audio_file(), duration=60) # load audio sample
y = np.repeat(y, 10) # repeat signal so that we can get more reliable measurements
sample_len = int(sr * 0.2) # We will compute MFCC for short pieces of audio

def get_mfcc_in_loop(audio, sr, sample_len):
    # We split long array into small ones of lenth sample_len
    y_windowed = np.array_split(audio, np.arange(sample_len, len(audio), sample_len))
    for sample in y_windowed:
        mfcc = librosa.feature.mfcc(y=sample, sr=sr)

start = time.time()
get_mfcc_in_loop(y, sr, sample_len)
print('Time single process:', time.time() - start)

# Let's test now feeding these small arrays to pool of 4 workers. Since computing
# MFCCs for these small arrays is fast, I'd expect this to be not that fast
start = time.time()
y_windowed = np.array_split(y, np.arange(sample_len, len(y), sample_len))
with Pool(n_proc) as pool:
    func = partial(librosa.feature.mfcc, sr=sr)
    result = pool.map(func, y_windowed)
print('Time multiprocessing (many small tasks):', time.time() - start)

# Here we split the audio into 4 chunks and process them separately. This I'd expect
# to be fast and somehow it isn't. What could be the cause? Anything to do about it?
start = time.time()
y_split = np.array_split(y, n_proc)
with Pool(n_proc) as pool:
    func = partial(get_mfcc_in_loop, sr=sr, sample_len=sample_len)
    result = pool.map(func, y_split)
print('Time multiprocessing (a few large tasks):', time.time() - start)

Results on my machine:

Time single process: 8.48s
Time multiprocessing (many small tasks): 44.20s
Time multiprocessing (a few large tasks): 41.99s

Any ideas what's causing it? Better yet, how to make it better?

To investigate what's happening, I run top -H and noticed +60 threads being spawned! That was it. Turns out librosa and dependencies spawn many extra threads that together ruin the parallelism.

Solution

The problem of oversubscription is well described in joblib docs. Let's use it then.

import time
import numpy as np
import librosa
from joblib import Parallel, delayed

n_proc = 4

y, sr = librosa.load(librosa.util.example_audio_file(), duration=60) # load audio sample
y = np.repeat(y, 10) # repeat signal so that we can get more reliable measurements
sample_len = int(sr * 0.2) # We will compute MFCC for short pieces of audio

def get_mfcc_in_loop(audio, sr, sample_len):
    # We split long array into small ones of lenth sample_len
    y_windowed = np.array_split(audio, np.arange(sample_len, len(audio), sample_len))
    for sample in y_windowed:
        mfcc = librosa.feature.mfcc(y=sample, sr=sr)

start = time.time()
y_windowed = np.array_split(y, np.arange(sample_len, len(y), sample_len))
Parallel(n_jobs=n_proc, backend='multiprocessing')(delayed(get_mfcc_in_loop)(audio=data, sr=sr, sample_len=sample_len) for data in y_windowed)
print('Time multiprocessing with joblib (many small tasks):', time.time() - start)


y_split = np.array_split(y, n_proc)
start = time.time()
Parallel(n_jobs=n_proc, backend='multiprocessing')(delayed(get_mfcc_in_loop)(audio=data, sr=sr, sample_len=sample_len) for data in y_split)
print('Time multiprocessing with joblib (a few large tasks):', time.time() - start)

Results: