I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.
For example:
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n
á --> a
ä --> a
ấ --> a
ṏ --> o
Etc.
I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.
Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.
The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).
Configure a
Collator
to sort onPRIMARY
differences in characters. With that, create aCollationKey
for each string. If all of your code is in Java, you can use theCollationKey
directly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.These classes use the Unicode standard case folding data to determine which characters are equivalent, and support various decomposition strategies.
Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The
Collator
class relieves you from having to track all of these rules and keep them up to date.There is a draft report on character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".
Here's a discussion and implementation of diacritic marker removal using Perl.
These existing SO questions are related:
Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.
For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.
For future reference, here is a C# extension method that removes accents.
It's part of Apache Commons Lang as of ver. 3.1.
returns
An
Please note that not all of these marks are just "marks" on some "normal" character, that you can remove without changing the meaning.
In Swedish, å ä and ö are true and proper first-class characters, not some "variant" of some other character. They sound different from all other characters, they sort different, and they make words change meaning ("mätt" and "matt" are two different words).