I am trying to read several thousands of hours of wav files in python and get their duration. This essentially requires opening the wav file, getting the number of frames and factoring in the sampling rate. Below is the code for that:
def wav_duration(file_name):
wv = wave.open(file_name, 'r')
nframes = wv.getnframes()
samp_rate = wv.getframerate()
duration = nframes / samp_rate
wv.close()
return duration
def build_datum(wav_file):
key = "/".join(wav_file.split('/')[-3:])[:-4]
try:
datum = {"wav_file" : wav_file,
"labels" : all_labels[key],
"duration" : wav_duration(wav_file)}
return datum
except KeyError:
return "key_error"
except:
return "wav_error"
Doing this sequentially will take too long. My understanding was that multi-threading should help here since it is essentially an IO task. Hence, I do just that:
all_wav_files = all_wav_files[:1000000]
data, key_errors, wav_errors = list(), list(), list()
start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
# submit jobs and get the mapping from futures to wav_file
future2wav = {executor.submit(build_datum, wav_file): wav_file for wav_file in all_wav_files}
for future in concurrent.futures.as_completed(future2wav):
wav_file = future2wav[future]
try:
datum = future.result()
if datum == "key_error":
key_errors.append(wav_file)
elif datum == "wav_error":
wav_errors.append(wav_file)
else:
data.append(datum)
except:
print("Generated exception from thread processing: {}".format(wav_file))
print("Time : {}".format(time.time() - start))
To my dismay, I however get the following results (in seconds):
Num threads | 100k wavs | 1M wavs
1 | 4.5 | 39.5
2 | 6.8 | 54.77
10 | 9.5 | 64.14
100 | 9.07 | 68.55
Is this expected? Is this a CPU intensive task? Will Multi-Processing help? How can I speed things up? I am reading files from the local drive and this is running on a Jupyter notebook. Python 3.5.
EDIT: I am aware of GIL. I just assumed that opening and closing a file is essentially IO. People's analysis have shown that in IO cases, it might be counter productive to use multi-processing. Hence I decided to use multi-processing instead.
I guess the question now is: Is this task IO bound?
EDIT EDIT: For those wondering, I think it was CPU bound (a core was maxing out to 100%). Lesson here is to not make assumptions about the task and check it for yourself.