I feel like this is a fairly common problem but I haven't yet found a suitable answer. I have many audio files of human speech that I would like to break on words, which can be done heuristically by looking at pauses in the waveform, but can anyone point me to a function/library in python that does this automatically?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- Can we recover audio from MFCC coefficients?
- How to get the background from multiple images by
- Evil ctypes hack in python
An easier way to do this is using pydub module. recent addition of silent utilities does all the heavy lifting such as
setting up silence threahold
,setting up silence length
. etc and simplifies code significantly as opposed to other methods mentioned.Here is an demo implementation , inspiration from here
Setup:
I had a audio file with spoken english letters from
A
toZ
in the file "a-z.wav". A sub-directorysplitAudio
was created in the current working directory. Upon executing the demo code, the files were split onto 26 separate files with each audio file storing each syllable.Observations: Some of the syllables were cut off, possibly needing modification of following parameters,
min_silence_len=500
silence_thresh=-16
One may want to tune these to one's own requirement.
Demo Code:
Output:
You could look at Audiolab It provides a decent API to convert the voice samples into numpy arrays. The Audiolab module uses the libsndfile C++ library to do the heavy lifting.
You can then parse the arrays to find the lower values to find the pauses.
Use IBM STT. Using
timestamps=true
you will get the word break up along with when the system detects them to have been spoken.There are a lot of other cool features like
word_alternatives_threshold
to get other possibilities of words andword_confidence
to get the confidence with which the system predicts the word. Setword_alternatives_threshold
to between (0.1 and 0.01) to get a real idea.This needs sign on, following which you can use the username and password generated.
The IBM STT is already a part of the speechrecognition module mentioned, but to get the word timestamp, you will need to modify the function.
An extracted and modified form looks like:
pyAudioAnalysis can segment an audio file if the words are clearly separated (this is rarely the case in natural speech). The package is relatively easy to use:
More details on my blog.