C# Speech Recognition from System Audio (Speaker S

I've seen speech recognition from input devices (obviously) and I've seen speech recognition from files (http://gotspeech.net/forums/thread/6835.aspx). However, I was wondering whether it would be possible to run speech recognition on system audio in real time. By system audio, the sound that comes out of your speakers.

It would be a great tool for those who are hard of hearing, as they are watching YouTube videos, the C# Application could transcribe what's being said.

How could I go about doing this?

Very easily - Go to the sound mixer, choose input and enable/unmute "Stereo Mix". You should, of course, mute the mic if you don't want to record that too. Then, just start recording the same way you'd record the mic - now you'll get the same feed as the speakers at digital quality.

This can be done programatically although it can be fiddly - especially if you want to support WinXP as well as Vista/Win7 (Sound was overhauled in Vista and I believe the APIs are significantly different although I haven't had to use them yet).

You're almost certainly going to need to filter the sound before attempting recognition. Unless the speech recog. library you're using is designed to work in adverse conditions, music and special effects will interfere with proper recognition as will multiple people speaking at the same time.

If you haven't got a super-robust library, filters to attenuate non-vocal frequencies are going to be a must. You may also need to apply volume normalisation to account for loud/quiet scenes - There are hundreds of filters that could potentially improve matching.

You may want to access the recognition API at the lowest level to get as much control as possible - You'll need to tweak it to cope with people shouting, breathless, crying, etc... If you start designing for flexible low-level access, it will probably save you weeks if you find you need it later on and have to re-architect.

I'd suggest you look into NAudio as a starting point for audio processing

I suspect you'll be able to get something which works under ideal conditions without too much effort - but tweaking it to work well in all eventualities may be a mammoth task. That said, it sounds like a fun project.

You could improve recognition chance considerably by creating genre-, user- or show-specific dictionaries. These could either be pre-generated, or built automatically using a weighted feedback loop - perhaps also allowing the user to correct mistakes.