I have a system where a user is asked to repeat a sentence after a prompt. It uses HTK
to force-align
the user-spoken-sentence to the pre-defined word level label file (of the sentence) to get a time-aligned phone level file. The HMMs have been trained on a large amount of data and give very accurate time-aligned files with HVite
. My problem arises when the user does not speak the exact sentence that is required to be spoken. Let me illustrate with an example:
Word level label file of the target sentence that needs to be spoken (known to the user):
THIS IS A VERY GOOD DAY.User says (Case 1): THIS IS A VERY GOOD DAY.
In this case, the user has repeated the exact same sentence. The time aligned file is very accurate and all is well.User says (Case 2): THIS IS A GOOD DAY.
In this case, the forced alignment is carried out with the word level label file as given above. The resulting time-aligned file shows time instants for words that were never spoken by the user (such as VERY which exists in the original sentence but not here).
Is there a way within HTK
to detect and possibly avoid this?
One solution would be some sort of a front-end pre-processor that would do speech-recognition (itself a very hard problem because it would have to have infinite vocab) and let the user know that what they have spoken is incorrect.
Are there any tools/command line options within HTK
that allow me to do this?
P.S: Please let me know in case more details are needed.
Thanks,
Sriram