Detecting syllables in a word

2019-01-03 00:59发布

I need to find a fairly efficient way to detect syllables in a word. E.g.,

Invisible -> in-vi-sib-le

There are some syllabification rules that could be used:

V CV VC CVC CCV CCCV CVCC

*where V is a vowel and C is a consonant. E.g.,

Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)

I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).

The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.

I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.

I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

15条回答
混吃等死
2楼-- · 2019-01-03 01:10

I used jsoup to do this once. Here's a sample syllable parser:

public String[] syllables(String text){
        String url = "https://www.merriam-webster.com/dictionary/" + text;
        String relHref;
        try{
            Document doc = Jsoup.connect(url).get();
            Element link = doc.getElementsByClass("word-syllables").first();
            if(link == null){return new String[]{text};}
            relHref = link.html(); 
        }catch(IOException e){
            relHref = text;
        }
        String[] syl = relHref.split("·");
        return syl;
    }
查看更多
Emotional °昔
3楼-- · 2019-01-03 01:11

Thank you @joe-basirico and @tihamer. I have ported @tihamer's code to Lua 5.1, 5.2 and luajit 2 (most likely will run on other versions of lua as well):

countsyllables.lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

And some fun tests to confirm it works (as much as it's supposed to):

countsyllables.tests.lua

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")
查看更多
贼婆χ
4楼-- · 2019-01-03 01:14

Here is a solution using NLTK:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]] 
查看更多
欢心
5楼-- · 2019-01-03 01:15

Today I found this Java implementation of Frank Liang's hyphenation algorithmn with pattern for English or German, which works quite well and is available on Maven Central.

Cave: It is important to remove the last lines of the .tex pattern files, because otherwise those files can not be loaded with the current version on Maven Central.

To load and use the hyphenator, you can use the following Java code snippet. texTable is the name of the .tex files containing the needed patterns. Those files are available on the project github site.

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);
            }

            public void info(String s) {
                logger.info(s);
            }

            public void warning(String s) {
                logger.warn("WARNING: " + s);
            }

            public void error(String s) {
                logger.error("ERROR: " + s);
            }

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);
            }

            public boolean isDebugged(String guard) {
                return false;
            }
        });

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
            hyphenator.loadTable(table);
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                    table.close();
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);
                }
            }
        }

        return hyphenator;
    }

Afterwards the Hyphenator is ready to use. To detect syllables, the basic idea is to split the term at the provided hyphens.

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

You need to split on "\u00AD", since the API does not return a normal "-".

This approach outperforms the answer of Joe Basirico, since it supports many different languages and detects German hyphenation more accurate.

查看更多
forever°为你锁心
6楼-- · 2019-01-03 01:17

Thanks Joe Basirico, for sharing your quick and dirty implementation in C#. I've used the big libraries, and they work, but they're usually a bit slow, and for quick projects, your method works fine.

Here is your code in Java, along with test cases:

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

The result was as expected (it works good enough for Flesch-Kincaid):

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2
查看更多
我命由我不由天
7楼-- · 2019-01-03 01:17

I could not find an adequate way to count syllables, so I designed a method myself.

You can view my method here: https://stackoverflow.com/a/32784041/2734752

I use a combination of a dictionary and algorithm method to count syllables.

You can view my library here: https://github.com/troywatson/Lawrence-Style-Checker

I just tested my algorithm and had a 99.4% strike rate!

Lawrence lawrence = new Lawrence();

System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));

Output:

4
3
查看更多
登录 后发表回答