best practice for specifying pronunciation for And

2019-01-13 21:40发布

问题:

In general, I'm very impressed with Android's default text to speech engine (i.e., com.svox.pico). As expected, it mispronounces some words (as do I) and it therefore occasionally needs some pronunciation guidance. So I'm wondering about best practices for phonetically spelling out those words that the pico TTS engine mispronounces.

For example, the correct pronunciation of the bird Chachalaca is CHAH-chah-LAH-kah. Here is what the TTS engine produces:

mTts.speak("Chachalaca", TextToSpeech.QUEUE_ADD, null); // output: chuh-KAL-uh-KUH
mTts.speak("CHAH-chah-LAH-kah", TextToSpeech.QUEUE_ADD, null); // output: CHAH-chah-EL-AY-AYCH-dash-kuh
mTts.speak("CHAHchahLAHkah", TextToSpeech.QUEUE_ADD, null); // output: CHA-chah-LAH-ka
mTts.speak("CHAH chah LOCKah", TextToSpeech.QUEUE_ADD, null); // output: CHAH-chah-LAH-kah

Here are my questions.

  • Is there a standard phonetic spelling recognized by the Android TTS engine?
  • If not, are there some general rules for making custom pronunciation spellings that will make the spellings more likely to be correct in future TTS engines/versions?
  • It appears that the Android TTS engine ignores text case. What is the best way to specify emphasis?

By the way, this is what the TTS engine writes to logcat:

V/TtsService( 294): TTS processing: CHAH chah LOCKah
V/TtsService( 294): TtsService.setLanguage(eng, USA, )
I/SVOX Pico Engine( 294): Language already loaded (en-US == en-US)
I/SynthProxy( 294): setting speech rate to 100
I/SynthProxy( 294): setting pitch to 100

[UPDATE]

I tried passing an XML document to TextToSpeech.speak() as follows:

            String text = "<?xml version=\"1.0\"?>" +
                "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" " +
                    "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " +
                    "xsi:schemaLocation=\"http://www.w3.org/2001/10/synthesis " +
                        "http://www.w3.org/TR/speech-synthesis/synthesis.xsd\" " +
                    "xml:lang=\"en-US\">" +

                    "That is a big car! " +
                    "That <emphasis>is</emphasis> a big car! " +
                    "That is a <emphasis>big</emphasis> car! " +
                    "That is a huge bank account! " +
                    "That <emphasis level=\"strong\">is</emphasis> a huge bank account! " +
                    "That is a <emphasis level=\"strong\">huge</emphasis> bank account!" +
                "</speak>";
            mTts.speak(text, TextToSpeech.QUEUE_ADD, null);

As Android Eve suggested, the TTS engine read only the XML body (i.e., the comments about the big car and the huge bank account). I didn't realize the TTS engine was capable of parsing XML documents. However, I did not hear any emphasis in the TTS output.

[UPDATE 2]

I simplified the question to whether or not Android TTS supports Speech Synthesis Markup Language here.

回答1:

JW answered my question at the tts-for-android group:

Hi Greg,

The Pico engine recognizes the tag with the XSAMPA alphabet.

There are no easy rules to derive a certain pronunciation from the orthograpy, but you can use intuitive spellings and trial and error. Capitalizing and hyphens will introduce more problems than solving them. Using different spellings and introducing extra word boundaries (spaces) can work.

The emphasis tag and the exclamation mark will not change the synthesis result. Use , , and commands instead.


Some examples of the proper syntax for specifying the pronunciation using the SSML phoneme tag are in these tests of TextToSpeech.

Even with these simple test SSML documents, there are warning messages posted to logcat about the SSML document not being well-formed. So I opened an issue about these seemingly incorrect logcat messages to the Android issue tracker.


The syntax for specifying an x-SAMPA sequence to SVOX pico is

String text = "<speak xml:lang=\"en-US\"> <phoneme alphabet=\"xsampa\" ph=\"d_ZIn\"/>.</speak>";
mTts.speak(text, TextToSpeech.QUEUE_ADD, null); 

Although more examples would be helpful, a good reference for x-SAMPA is at http://en.wikipedia.org/wiki/Xsampa If I compile a couple dozen examples, I'll post them to that Wikipedia page.



回答2:

One answer for all 3 questions: Look at the SSML specifications: http://www.w3.org/TR/speech-synthesis/

For example, to specify emphasis, you use the emphasis element, e.g.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  That is a <emphasis> big </emphasis> car!
  That is a <emphasis level="strong"> huge </emphasis>
  bank account!
</speak>