Converting Symbols, Accent Letters to English Alph

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.

For instance here are a few conversions:

ҥ->H
Ѷ->V
Ȳ->Y
Ǭ->O
Ƈ->C
tђє Ŧค๓เℓy --> the Family
...

and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.

The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.

How can I convert all these with Java? Please help me :(

标签： java unicode special-characters diacritics

12条回答

几人难应

2楼-- · 2019-01-01 00:46

The original request has been answered already.

However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.

Naive meaning of tranliteration: Translated string in it's final form/target charset sounds like the string in it's original form. If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.

Here is the code snippet in java:

    import com.ibm.icu.text.Transliterator; //ICU4J library import

    public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
    public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";

    /**
    * Returns the transliterated string to convert any charset to latin.
    */
    public static String transliterate(String input) {
        Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
        String result = transliterator.transliterate(input);
        return result;
    }

0人赞添加讨论(0) 举报

孤独总比滥情好

3楼-- · 2019-01-01 00:48

The problem with "converting" arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent. For example, “ß” to a German-speaking person should be converted to "ss" while an English-speaker would probably convert it to “B”.

Add to that the fact that Unicode has multiple code points for the same glyphs.

The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to. You can take a shortcut by normalizing characters with accents to normalization form KD, but not all characters normalize to ASCII. In addition, Unicode does not define which parts of a glyph are "accents".

Here is a tiny excerpt from an app that does this:

switch (c)
{
    case 'A':
    case '\u00C0':  //  À LATIN CAPITAL LETTER A WITH GRAVE
    case '\u00C1':  //  Á LATIN CAPITAL LETTER A WITH ACUTE
    case '\u00C2':  //  Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
    // and so on for about 20 lines...
        return "A";
        break;

    case '\u00C6'://  Æ LATIN CAPITAL LIGATURE AE
        return "AE";
        break;

    // And so on for pages...
}

0人赞添加讨论(0) 举报

不流泪的眼

4楼-- · 2019-01-01 00:49

Reposting my post from How do I remove diacritics (accents) from a string in .NET?

This method works fine in java (purely for the purpose of removing diacritical marks aka accents).

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

0人赞添加讨论(0) 举报

春风洒进眼中

5楼-- · 2019-01-01 00:52

There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to. They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter.

If you want that conversion, you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to.

(If you only want to remove diacritial marks, there are some answers in this thread: How do I remove diacritics (accents) from a string in .NET? However you describe a more general problem)

0人赞添加讨论(0) 举报

宁负流年不负卿

6楼-- · 2019-01-01 00:54

It's a part of Apache Commons Lang as of ver. 3.0.

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/

0人赞添加讨论(0) 举报

谁念西风独自凉

7楼-- · 2019-01-01 00:55

You could try using unidecode, which is available as a ruby gem and as a perl module on cpan. Essentially, it works as a huge lookup table, where each unicode code point relates to an ascii character or string.

0人赞添加讨论(0) 举报

1 2 下一页

Converting Symbols, Accent Letters to English Alph

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间