My script is calling librosa module to compute Mel-frequency cepstral coefficients (MFCCs) for short pieces of audio. After loading the audio, I'd like to compute these (along with some other audio features) as fast as possible - hence multiprocessing.
Problem: multiprocessing variant is much slower than sequential. Profiling says my code spends over 90% of the time on <method 'acquire' of '_thread.lock' objects>
. It's not surprising if it were many small tasks, but in one test case, I am dividing my audio into 4 chunks and process then in separate processes. I was thinking overhead should be minimal, but in practice, it's almost as bad as with many small tasks.
To my understanding, multiprocessing module should fork almost everything and there should not be any fighting for a lock. However, the results seem to show something different. Could it be that librosa module underneath keeps some sort of internal lock?
My profiling results in plain text: https://drive.google.com/open?id=17DHfmwtVOJOZVnwIueeoWClUaWkvhTPc
As an image: https://drive.google.com/open?id=1KuZyo0CurHd9GjXge5CYQhdWn2Q6OG8Z
The code to reproduce the "problem":
import time
import numpy as np
import librosa
from functools import partial
from multiprocessing import Pool
n_proc = 4
y, sr = librosa.load(librosa.util.example_audio_file(), duration=60) # load audio sample
y = np.repeat(y, 10) # repeat signal so that we can get more reliable measurements
sample_len = int(sr * 0.2) # We will compute MFCC for short pieces of audio
def get_mfcc_in_loop(audio, sr, sample_len):
# We split long array into small ones of lenth sample_len
y_windowed = np.array_split(audio, np.arange(sample_len, len(audio), sample_len))
for sample in y_windowed:
mfcc = librosa.feature.mfcc(y=sample, sr=sr)
start = time.time()
get_mfcc_in_loop(y, sr, sample_len)
print('Time single process:', time.time() - start)
# Let's test now feeding these small arrays to pool of 4 workers. Since computing
# MFCCs for these small arrays is fast, I'd expect this to be not that fast
start = time.time()
y_windowed = np.array_split(y, np.arange(sample_len, len(y), sample_len))
with Pool(n_proc) as pool:
func = partial(librosa.feature.mfcc, sr=sr)
result = pool.map(func, y_windowed)
print('Time multiprocessing (many small tasks):', time.time() - start)
# Here we split the audio into 4 chunks and process them separately. This I'd expect
# to be fast and somehow it isn't. What could be the cause? Anything to do about it?
start = time.time()
y_split = np.array_split(y, n_proc)
with Pool(n_proc) as pool:
func = partial(get_mfcc_in_loop, sr=sr, sample_len=sample_len)
result = pool.map(func, y_split)
print('Time multiprocessing (a few large tasks):', time.time() - start)
Results on my machine:
- Time single process: 8.48s
- Time multiprocessing (many small tasks): 44.20s
- Time multiprocessing (a few large tasks): 41.99s
Any ideas what's causing it? Better yet, how to make it better?