Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ

2018-12-31 16:25发布

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

For example:

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> n
á --> a
ä --> a
ấ --> a
ṏ --> o

Etc.

  1. I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.

  2. Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.

12条回答
零度萤火
2楼-- · 2018-12-31 17:00

The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).

Configure a Collator to sort on PRIMARY differences in characters. With that, create a CollationKey for each string. If all of your code is in Java, you can use the CollationKey directly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.

These classes use the Unicode standard case folding data to determine which characters are equivalent, and support various decomposition strategies.

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY);
Map<CollationKey, String> dictionary = new TreeMap<CollationKey, String>();
dictionary.put(c.getCollationKey("Björn"), "Björn");
...
CollationKey query = c.getCollationKey("bjorn");
System.out.println(dictionary.get(query)); // --> "Björn"

Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The Collator class relieves you from having to track all of these rules and keep them up to date.

查看更多
若你有天会懂
3楼-- · 2018-12-31 17:15

There is a draft report on character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".

Here's a discussion and implementation of diacritic marker removal using Perl.

These existing SO questions are related:

查看更多
步步皆殇っ
4楼-- · 2018-12-31 17:15

Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.

For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.

人间绝色
5楼-- · 2018-12-31 17:18

For future reference, here is a C# extension method that removes accents.

public static class StringExtensions
{
    public static string RemoveDiacritics(this string str)
    {
        return new string(
            str.Normalize(NormalizationForm.FormD)
                .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != 
                            UnicodeCategory.NonSpacingMark)
                .ToArray());
    }
}
static void Main()
{
    var input = "ŃŅŇ ÀÁÂÃÄÅ ŢŤţť Ĥĥ àáâãäå ńņň";
    var output = input.RemoveDiacritics();
    Debug.Assert(output == "NNN AAAAAA TTtt Hh aaaaaa nnn");
}
查看更多
萌妹纸的霸气范
6楼-- · 2018-12-31 17:19

It's part of Apache Commons Lang as of ver. 3.1.

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

梦醉为红颜
7楼-- · 2018-12-31 17:20

Please note that not all of these marks are just "marks" on some "normal" character, that you can remove without changing the meaning.

In Swedish, å ä and ö are true and proper first-class characters, not some "variant" of some other character. They sound different from all other characters, they sort different, and they make words change meaning ("mätt" and "matt" are two different words).

登录 后发表回答