I am exploring the SpeechRecognitionEngine
's capabilities, and my end goal is to input a WAV file and a transcription of that WAV file, and to output the positions in the WAV file of the beginning (and ideally, end) of each word.
I can get the engine to recognize the phrase successfully, but I can not understand how to retrieve the audio positions when the word starts, not when the recognition was hypothesized or recognized, etc.
If you're curious what the point of this is, it is in automating lipsync animation workflows.
Thanks for your time.
Proper audio to text alignment is a task which requires specific algorithms different from the speech recognition. You can emulate some alignment functionality with ASR engine, but it will work good.
For the implementations of the alignment algorithms you can check CMUSphinx speech recognition toolkit:
http://cmusphinx.sourceforge.net/?s=long+audio+alignment
http://www.bluevincent.com/2011/02/speech-to-text-using-java.html
Or you can try commercial company service like the one from Nexiwave
http://nexiwave.com/index.php/applications/transcription-timestamping