No benefit from Python Multi-threading in IO task?

2019-07-17 06:58发布

问题:

I am trying to read several thousands of hours of wav files in python and get their duration. This essentially requires opening the wav file, getting the number of frames and factoring in the sampling rate. Below is the code for that:

def wav_duration(file_name):
    wv = wave.open(file_name, 'r')
    nframes = wv.getnframes()
    samp_rate = wv.getframerate()
    duration = nframes / samp_rate
    wv.close()
    return duration


def build_datum(wav_file):
    key = "/".join(wav_file.split('/')[-3:])[:-4]
    try:
        datum = {"wav_file" : wav_file,
                "labels"    : all_labels[key],
                "duration"  : wav_duration(wav_file)}

        return datum
    except KeyError:
        return "key_error"
    except:
        return "wav_error"

Doing this sequentially will take too long. My understanding was that multi-threading should help here since it is essentially an IO task. Hence, I do just that:

all_wav_files = all_wav_files[:1000000]
data, key_errors, wav_errors = list(), list(), list()

start = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
    # submit jobs and get the mapping from futures to wav_file
    future2wav = {executor.submit(build_datum, wav_file): wav_file for wav_file in all_wav_files}
    for future in concurrent.futures.as_completed(future2wav):
        wav_file = future2wav[future]
        try:
            datum = future.result()
            if datum == "key_error":
                key_errors.append(wav_file)
            elif datum == "wav_error":
                wav_errors.append(wav_file)
            else:
                data.append(datum)
        except:
            print("Generated exception from thread processing: {}".format(wav_file))

print("Time : {}".format(time.time() - start))

To my dismay, I however get the following results (in seconds):

Num threads | 100k wavs | 1M wavs
1           | 4.5       | 39.5
2           | 6.8       | 54.77
10          | 9.5       | 64.14
100         | 9.07      | 68.55

Is this expected? Is this a CPU intensive task? Will Multi-Processing help? How can I speed things up? I am reading files from the local drive and this is running on a Jupyter notebook. Python 3.5.

EDIT: I am aware of GIL. I just assumed that opening and closing a file is essentially IO. People's analysis have shown that in IO cases, it might be counter productive to use multi-processing. Hence I decided to use multi-processing instead.

I guess the question now is: Is this task IO bound?

EDIT EDIT: For those wondering, I think it was CPU bound (a core was maxing out to 100%). Lesson here is to not make assumptions about the task and check it for yourself.

回答1:

Some things to check by category:

Code

  • How efficient is wave.open ? Is it loading the entire file into memory when it could simply be reading header information?
  • Why is max_workers set to 1 ?
  • Have you tried using cProfile or even timeit to get an idea of what particular part of code is taking more time?

Hardware

Re-run your existing setup with some hard disk activity, memory usage and CPU monitoring to confirm that hardware is not your limiting factor. If you see your hard disk running at maximum IO, your memory getting full or all CPU cores at 100% - one of those could be at its limit.

Global Interpreter Lock (GIL)

If there are no obvious hardware limitations, you are most likely running into problems with Python's Global Interpreter Lock (GIL), as described well in this answer. This behavior is to be expected if your code has been limited to running on a single core or there is no effective concurrency in running threads. In this case, I'd most certainly change to multiprocessing, starting by creating one process per CPU core, run that and then compare hardware monitoring results with the previous run.