I need to find a fairly efficient way to detect syllables in a word. E.g.,
Invisible -> in-vi-sib-le
There are some syllabification rules that could be used:
V CV VC CVC CCV CCCV CVCC
*where V is a vowel and C is a consonant. E.g.,
Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)
I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).
The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.
I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.
I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.
I used jsoup to do this once. Here's a sample syllable parser:
Thank you @joe-basirico and @tihamer. I have ported @tihamer's code to Lua 5.1, 5.2 and luajit 2 (most likely will run on other versions of lua as well):
countsyllables.lua
And some fun tests to confirm it works (as much as it's supposed to):
countsyllables.tests.lua
Here is a solution using NLTK:
Today I found this Java implementation of Frank Liang's hyphenation algorithmn with pattern for English or German, which works quite well and is available on Maven Central.
Cave: It is important to remove the last lines of the
.tex
pattern files, because otherwise those files can not be loaded with the current version on Maven Central.To load and use the
hyphenator
, you can use the following Java code snippet.texTable
is the name of the.tex
files containing the needed patterns. Those files are available on the project github site.Afterwards the
Hyphenator
is ready to use. To detect syllables, the basic idea is to split the term at the provided hyphens.You need to split on
"\u00AD
", since the API does not return a normal"-"
.This approach outperforms the answer of Joe Basirico, since it supports many different languages and detects German hyphenation more accurate.
Thanks Joe Basirico, for sharing your quick and dirty implementation in C#. I've used the big libraries, and they work, but they're usually a bit slow, and for quick projects, your method works fine.
Here is your code in Java, along with test cases:
The result was as expected (it works good enough for Flesch-Kincaid):
I could not find an adequate way to count syllables, so I designed a method myself.
You can view my method here: https://stackoverflow.com/a/32784041/2734752
I use a combination of a dictionary and algorithm method to count syllables.
You can view my library here: https://github.com/troywatson/Lawrence-Style-Checker
I just tested my algorithm and had a 99.4% strike rate!
Output: