If I use Java 8's String.codePoints to get an

2020-03-08 07:20发布

问题:

Given a String string in Java, does string.codePoints().toArray().length reflect the length of the String in terms of the actual characters that a human would find meaningful? In other words, does it smooth over escape characters and other artifacts of encoding?

Edit By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters, ESC as one character, etc. But now I see that even the accent marks get atomized so it doesn't matter.

回答1:

No.

For example:

  • Control characters (such as ESC, CR, NL, etcetera) will not be removed. These have distinct codepoints in Unicode.

  • Sequences of spaces, tabs, etc are not combined

  • Discretionary hyphen (http://www.fileformat.info/info/unicode/char/00AD/index.htm) characters are not removed.

  • Unicode combining characters (https://en.wikipedia.org/wiki/Combining_character) are not combined.


Now it is debatable whether some of these might be "actual characters that a human would find meaningful" ... but the overall answer is still No.


You clarified as follows:

By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters ...

It is more complicated than that. I am a programmer, and for me it depends on the context whether \r\n are meaningful or not. If I am reading a README file, my brain will treat differences in white space as having no semantic importance. But if I am writing a parser, my code would take whitespace into account ... depending on the language it is intended to parse.



回答2:

Just check the Javadoc of CharSequence for the codePoints() method :

Returns a stream of code point values from this sequence. Any surrogate pairs encountered in the sequence are combined as if by Character.toCodePoint and the result is passed to the stream. Any other code units, including ordinary BMP characters, unpaired surrogates, and undefined code units, are zero-extended to int values which are then passed to the stream. https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--

And the one in the String classes related to code points to understand what a code point is :

String(int[] codePoints, int offset, int count) Allocates a new String that contains characters from a subarray of the Unicode code point array argument.https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

A code point is an int representing a Unicode code point (https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#unicode) so all characters are included even those non-human-readable.



回答3:

String object.codePoints() returns a stream of characters in Java 8.On which you are calling toArray method,so it will treat each character in a seperate manner and will return number of characters.