Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

For example:

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> n
á --> a
ä --> a
ấ --> a
ṏ --> o

Etc.

I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.
Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.

标签： java unicode diacritics transliteration

12条回答

零度萤火

2楼-- · 2018-12-31 17:00

The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).

Configure a Collator to sort on PRIMARY differences in characters. With that, create a CollationKey for each string. If all of your code is in Java, you can use the CollationKey directly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.

These classes use the Unicode standard case folding data to determine which characters are equivalent, and support various decomposition strategies.

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY);
Map<CollationKey, String> dictionary = new TreeMap<CollationKey, String>();
dictionary.put(c.getCollationKey("Björn"), "Björn");
...
CollationKey query = c.getCollationKey("bjorn");
System.out.println(dictionary.get(query)); // --> "Björn"

Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The Collator class relieves you from having to track all of these rules and keep them up to date.

0人赞添加讨论(0) 举报

若你有天会懂

3楼-- · 2018-12-31 17:15

There is a draft report on character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".

Here's a discussion and implementation of diacritic marker removal using Perl.

These existing SO questions are related:

0人赞添加讨论(0) 举报

步步皆殇っ

4楼-- · 2018-12-31 17:15

Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.

For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.

0人赞添加讨论(0) 举报

人间绝色

5楼-- · 2018-12-31 17:18

For future reference, here is a C# extension method that removes accents.

public static class StringExtensions
{
    public static string RemoveDiacritics(this string str)
    {
        return new string(
            str.Normalize(NormalizationForm.FormD)
                .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != 
                            UnicodeCategory.NonSpacingMark)
                .ToArray());
    }
}
static void Main()
{
    var input = "ŃŅŇ ÀÁÂÃÄÅ ŢŤţť Ĥĥ àáâãäå ńņň";
    var output = input.RemoveDiacritics();
    Debug.Assert(output == "NNN AAAAAA TTtt Hh aaaaaa nnn");
}

0人赞添加讨论(0) 举报

萌妹纸的霸气范

6楼-- · 2018-12-31 17:19

It's part of Apache Commons Lang as of ver. 3.1.

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

0人赞添加讨论(0) 举报

梦醉为红颜

7楼-- · 2018-12-31 17:20

Please note that not all of these marks are just "marks" on some "normal" character, that you can remove without changing the meaning.

In Swedish, å ä and ö are true and proper first-class characters, not some "variant" of some other character. They sound different from all other characters, they sort different, and they make words change meaning ("mätt" and "matt" are two different words).

0人赞添加讨论(0) 举报

1 2 下一页

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间