Microsoft SAPI System.Speech for transcription

2019-08-13 09:25发布

I'm currently doing a research on a tool that is able to transcribe audio files. The first thing I look at is the possibility of using Microsoft's System.Speech API.

Looking through the msdn documentation, it seems like, this tool is more suitable for short voice commands where you have some knowledge of what to expect from the speaker. It requires you to creation of Grammar for good accuracy.

Can someone who has experienced with this API confirm whether this is right?

标签: .net sapi
2条回答
Juvenile、少年°
2楼-- · 2019-08-13 10:17

To expand on Lesley's answer -

Microsoft has 3 different SR engines available, with different tradeoffs.

  • System.Speech.Recognition (or Desktop SAPI) - supports single-person dictation and input from a wave file (or other stream), but the recognizer has to be trained for a particular person in order to get good recognition. In addition, the input source must be of high quality (low noise, 16 bit, 22KHz sample rate).

    • Microsoft.Speech.Recognition (or Server SAPI) - doesn't support dictation at all, but does take input from a wave file (or other stream), does not need training, and works with lower quality input sources (more noise, 8 bit, 8 KHz sample rate).

    • Windows.Media.Speech.Recognition - the new Windows Runtime speech recognition API. Supports dictation, does not need training, works with lower quality input sources, but doesn't take input from a wave file, and requires that your app be based on the Windows Runtime.

For a transcription scenario, I'd investigate the Windows.Media.Speech.Recognition tools, and look at something like Virtual Audio Cable to create a fake default audio input device.

查看更多
狗以群分
3楼-- · 2019-08-13 10:22

Yes and no.

While theoretically any speech recognizer could implement SAPI (and therefore theoretically have ANY degree of accuracy), the stock windows recognizer I've found is profoundly good for command and control, but not so much for free form dictation or things like keyword spotting.

That's not to say you couldn't recognize a robust selection of words and have it be very accurate. I've had SAPI recognize and speak Klingon, and have had massively sized grammar files. It's just that when you attempt to create your own recognizer, or even your own SAPI voice, there is an absolute dearth of information. Typically the people that could help you are unlikely to precisely BECAUSE it is so difficult or the information they have is proprietary.

If you have a larger lexicon that you'd like to have recognized in a free form fashion, you'd probably be better served with something like Sphinx.

查看更多
登录 后发表回答