I'm currently doing a research on a tool that is able to transcribe audio files. The first thing I look at is the possibility of using Microsoft's System.Speech API.
Looking through the msdn documentation, it seems like, this tool is more suitable for short voice commands where you have some knowledge of what to expect from the speaker. It requires you to creation of Grammar for good accuracy.
Can someone who has experienced with this API confirm whether this is right?
Yes and no.
While theoretically any speech recognizer could implement SAPI (and therefore theoretically have ANY degree of accuracy), the stock windows recognizer I've found is profoundly good for command and control, but not so much for free form dictation or things like keyword spotting.
That's not to say you couldn't recognize a robust selection of words and have it be very accurate. I've had SAPI recognize and speak Klingon, and have had massively sized grammar files. It's just that when you attempt to create your own recognizer, or even your own SAPI voice, there is an absolute dearth of information. Typically the people that could help you are unlikely to precisely BECAUSE it is so difficult or the information they have is proprietary.
If you have a larger lexicon that you'd like to have recognized in a free form fashion, you'd probably be better served with something like Sphinx.
To expand on Lesley's answer -
Microsoft has 3 different SR engines available, with different tradeoffs.
System.Speech.Recognition (or Desktop SAPI) - supports single-person
dictation and input from a wave file (or other stream), but the
recognizer has to be trained for a particular person in order to get
good recognition. In addition, the input source must be of high
quality (low noise, 16 bit, 22KHz sample rate).
Microsoft.Speech.Recognition (or Server SAPI) - doesn't support
dictation at all, but does take input from a wave file (or other
stream), does not need training, and works with lower quality input
sources (more noise, 8 bit, 8 KHz sample rate).
Windows.Media.Speech.Recognition - the new Windows Runtime speech
recognition API. Supports dictation, does not need training, works
with lower quality input sources, but doesn't take input from a wave
file, and requires that your app be based on the Windows Runtime.
For a transcription scenario, I'd investigate the Windows.Media.Speech.Recognition tools, and look at something like Virtual Audio Cable to create a fake default audio input device.