Why does the Synopse hyphenation code give differe

2019-02-18 11:58发布

This question follows previous question but different. Synopse's delphi hyphenation is very fast and builts on OpenOffice libhnj library that uses TeX hyphenation.

A simple test is :

If I input 'pronunciation', the Synopse hyphenation outputs 'pro=nun=ci=ation' (4 possible hyphens or syllables). //(not 'pro=nun=ci=a=tion', 5 hyphens or syllables).

I read 2 papers (here and here) about Tex hyphenation algorithm uses in syllabification. Authors stated about 95% accuracy in syllabification. I tested Synopse hyphenation only for counting syllables on CMU Pronouncing Dictionary, but only about 53% accuracy.

Why is the result significantly different?

I reproduce my method in a little detailed way.

I parse the CMU pronuncing dictionary to compute all number of words. The CMU dic is like:

PRONOUNS  P R OW1 N AW0 N Z
PRONOVOST  P R OW0 N OW1 V OW0 S T
PRONTO  P R AA1 N T OW0
PRONUNCIATION  P R OW0 N AH2 N S IY0 EY1 SH AH0 N
PRONUNCIATION(1)  P R AH0 N AH2 N S IY0 EY1 SH AH0 N

I will have this result:

PRONOUNS=2
PRONOVOST=3
PRONTO=2
PRONUNCIATION(1)=5 // will be ignored
PRONUNCIATION=5   // use this one

Words with parentheses will be ignored when compared with the Synopse hyphenation lib. They are alternative or secondary pronunciations (variants).

Similarly, I will use the hyphenation lib to compute the number of syllables of each word in the CMU dictionary. Then I compare the two to see how many match. The words with different numbers of syllables are recorded like below:

...

94814 cmu PROMULGATED=4 | PROMULGATED=3 Synopse Hyphenation
94821 cmu PRONGER=2 | PRONGER=1 Synopse Hyphenation
94829 cmu PRONOUNCES=3 | PRONOUNCES=2 Synopse Hyphenation
94833 cmu PRONTO=2 | PRONTO=1 Synopse Hyphenation
94835 cmu PRONUNCIATION=5 | PRONUNCIATION=4 Synopse Hyphenation

...

The total line number of CMU is 123611 (excluding lines with parentheses and lines without meaningful words, like quotation mark lines '('). The total different number of syllables of the same words for the two: 57870.

CMU may not be the standard of syllable numbers. In this test, (123611-57870)/123611=53.183%. This is significantly different from the accuracy rate stated by the author in their paper above. Of course, they used a another database (CELEX) for their tests. Why is the result so different?

The Synopse hyphenation library is very fast. I want to know further if this is due to the pattern file (dic file used for hyphenation originally from libhnj used in OpenOffice). Or did the author of the paper use a different dictionary file?

1条回答
闹够了就滚
2楼-- · 2019-02-18 12:42

In short, I believe the reason that the difference in accuracy is so great between what was reported in our SPIRE 2009 paper and the results being reported here is because we trained the method instead of using patterns generated through prior training (which, from what I can gather, is what you are doing here).

How we performed training to obtain our patterns is described briefly on the third page of our paper (pg.176) and in more detail in Section 4.3 of my thesis which you can find here: http://web.cs.dal.ca/~adsett/Adsett_SyllAlgs_2008.pdf

查看更多
登录 后发表回答