I want to build NEW Acoustic model ,New Dictionary ,New Language model for "Sinhala Language speech recognition
" Sinhala language Characters are Unicode based. for an example A=අ,I=ඉ,U=උ,KA=ක,BA=බ.
I did go through CMUSphinx Tutorial For Developers. But it did not help me. It works for English language.
Language model should be ARPA model. and How can I map Sinhala Unicode with English phonemes and how to train Language model with Different voices.
Is there any tool available for generate Unicode based language model?
Overall, it is not really complex. First you need to split the task on parts: build phonetic dictionary, build language model, build acoustic model. Start with phonetic dictionary.
You need to write a Python script to map unicode input to the transliteration:
රට r a tt a
එකඟයි e k a ng a yi
අවසර දිම a v a s a r a d i m a
Basically for every you write a corresponding transliteration. That is all you need to do, later you can just feed the list of words into your script and get a dictionary in cmusphinx format. This part is covered in tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialdict
Once you have a transliteration tool you can proceed with language model. You need a lot of texts to build a language model. You can download texts from wikipedia or from local newspaper. Then you can use any language model toolkit to create an ARPA model. All of them support unicode - SRILM, MITLM, IRSTLM, you can use any of them. This part is covered in tutorial
http://cmusphinx.sourceforge.net/wiki/tutoriallm
Third step is to create an acoustic model. You need to record audio or segment existing recordings and start training. This part is also covered in the tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialam