I need to identify the "quality" of the user's pronunciation with the help of Microsoft speech SDK (System.Speech.Recognition
). I am using MS Speech Engine - US, so what I actually need is to find out how close the speaker's voice is to the "North American" accent.
One way of doing this is by checking how close the user's voice is to the US English phonetic pronunciation. As mentioned in MSDN, it seems like this process is done inside the speech SDK by it self, so I need to get that out. Since we can set the phonetic to the engine by our selves as well, I am sure this is possible.
However, I have no clear idea about what I have to do. So, what can I do to find out the quality of the user's pronunciation/How close it is to US North American English phonetic pronunciation? User will only have to speak pre-defined sentences like "Hello World. I am here".
Please help.
UPDATE
I got some kind of "phonemes" (as mentioned in MSDN) by the use of following code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Speech.Recognition;
using System.Speech.Synthesis;
using System.Windows.Forms;
using System.IO;
namespace US_Speech_Recognizer
{
public class RecognizeSpeech
{
private SpeechRecognitionEngine sEngine; //Speech recognition engine
private SpeechSynthesizer sSpeak; //Speech synthesizer
string text3 = "";
public RecognizeSpeech()
{
//Make the recognizer ready
sEngine = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
//Load grammar
Choices sentences = new Choices();
sentences.Add(new string[] { "I am hungry" });
GrammarBuilder gBuilder = new GrammarBuilder(sentences);
Grammar g = new Grammar(gBuilder);
sEngine.LoadGrammar(g);
//Add a handler
sEngine.SpeechRecognized +=new EventHandler<SpeechRecognizedEventArgs>(sEngine_SpeechRecognized);
sSpeak = new SpeechSynthesizer();
sSpeak.Rate = -2;
//Computer speaks the words to get the phones
Stream stream = new MemoryStream();
sSpeak.SetOutputToWaveStream(stream);
sSpeak.Speak("I was hungry");
stream.Position = 0;
sSpeak.SetOutputToNull();
//Configure the recognizer to stream
sEngine.SetInputToWaveStream(stream);
sEngine.RecognizeAsync(RecognizeMode.Single);
}
//Start the speech recognition task
private void sEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
string text = "";
if (e.Result.Text == "I am hungry")
{
foreach (RecognizedWordUnit wordUnit in e.Result.Words)
{
text = text + wordUnit.Pronunciation + "\n";
}
MessageBox.Show(e.Result.Text + "\n" + text);
}
}
}
}
This is the direct code snippet related to the phonemes (extracted from the above code)
//Start the speech recognition task
private void sEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
string text = "";
if (e.Result.Text == "I am hungry")
{
foreach (RecognizedWordUnit wordUnit in e.Result.Words)
{
text = text + wordUnit.Pronunciation + "\n";
}
MessageBox.Show(e.Result.Text + "\n" + text);
}
}
Following is my output. The phonemes I got are displayed starting from the second line. First line simply shows the recognized sentence
So, please tell me, according to the MSDN this is "phonemes". So, is this is the "phonemes" actually? I have never seen these, that is why.
above code is done according to this link http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.srgsgrammar.srgstoken.pronunciation(v=office.14).aspx
Ok, here's how I'd approach the problem.
First, load up the dictation engine with the Pronunciation topic, which will return the phonemes spoken by the user (in the Recognition event).
Second, get the reference phonemes for the word using the ISpEnginePronunciation::GetPronunciations method (as I outlined here).
Once you have the two sets of phonemes, you can compare them. Essentially, the phonemes are separated by spaces, and each phoneme is represented by a short tag (described in the American English Phoneme Representation spec).
Given this, you should be able to compute a score by comparing the phonemes by any number of approximate string matching schemes (e.g., Levenshtein distance).
You might find the problem simpler by comparing phone IDs rather than strings; ISpPhoneConverter::PhoneToId can convert the phoneme strings to an array of phoneIDs, one ID per phoneme. That would give you a pair of null-terminated integer arrays, perhaps better suited for your comparison algorithm.
You could use the engine confidence to penalize matches, as low engine confidence indicates that the incoming audio doesn't closely match the engine's idea of the phoneme.