How do I convert unicode codepoints to their chara

2020-05-31 06:02发布

问题:

How do I convert strings representing code points to the appropriate character?

For example, I want to have a function which gets U+00E4 and returns ä.

I know that in the character class I have a function toChars(int codePoint) which takes an integer but there is no function which takes a string of this type.

Is there a built in function or do I have to do some transformation on the string to get the integer which I can send to the function?

回答1:

Code points are written as hexadecimal numbers prefixed by U+

So,you can do this

int codepoint=Integer.parseInt(yourString.substring(2),16);
char[] ch=Character.toChars(codepoint);


回答2:

"\u00E4"

new String(new int[] { 0x00E4 }, 0, 1);


回答3:

Converted from Kotlin:

    public String codepointToString(int cp) {
        StringBuilder sb = new StringBuilder();
        if (Character.isBmpCodePoint(cp)) {
            sb.append((char) cp);
        } else if (Character.isValidCodePoint(cp)) {
            sb.append(Character.highSurrogate(cp));
            sb.append(Character.lowSurrogate(cp));
        } else {
            sb.append('?');
        }
        return sb.toString();
    }


回答4:

this example does not use char[].

// this code is Kotlin, but you can write same thing in Java
val sb = StringBuilder()
val cp :Int // codepoint
when {
    Character.isBmpCodePoint(cp) -> sb.append(cp.toChar())
    Character.isValidCodePoint(cp) -> {
        sb.append(Character.highSurrogate(cp))
        sb.append(Character.lowSurrogate(cp))
    }
    else -> sb.append('?')
}


回答5:

The question asked for a function to convert a string value representing a Unicode code point (i.e. "+Unnnn" rather than the Java formats of "\unnnn" or "0xnnnn). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:

  • The introduction of Streams in Java 8.
  • Method public static String toString​(int codePoint) which was added to the Character class in Java 11. It returns a String rather than a char[], so Character.toString(0x00E4) returns "ä".

Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String in a single statement:

void processUnicode() {

    // Create a test string containing "Hello World               
                            
标签: java unicode