Removing accent marks (diacritics) from Latin char

2019-03-29 06:15发布

This question already has an answer here:

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars 12 answers

I need to compare the names of European places that are written using the Latin alphabet with accent marks (diacritics) on some characters. There are lots of Central and Eastern European names that are written with accent marks like Latin characters on ž and ü, but some people write the names just using the regular Latin characters without accent marks like z and u.

I need a way to have my system recognize for example mšk žilina being the same as msk zilina, and similar for all the other accented characters used. Is there a simple way to do this?

标签： java string diacritics transliteration

1条回答

Root（大扎）

2楼-- · 2019-03-29 06:35

You can make use of java.text.Normalizer and a little regex to get rid of the diacritical marks.

public static String removeDiacriticalMarks(String string) {
    return Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

Usage example:

String text = "mšk žilina";
String normalized = removeDiacriticalMarks(text);
System.out.println(normalized); // msk zilina

0人赞添加讨论(0) 举报

Removing accent marks (diacritics) from Latin char

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间