I know there is String#length
and the various methods in Character
which more or less work on code units/code points.
What is the suggested way in Java to actually return the result as specified by Unicode standards (UAX#29), taking things like language/locale, normalization and grapheme clusters into account?
The normal model of Java string length
String.length()
is specified as returning the number of char
values ("code units") in the String. That is the most generally useful definition of the length of a Java String; see below.
Your description1 of the semantics of length
based on the size of the backing array/array slice is incorrect. The fact that the value returned by length()
is also the size of the backing array or array slice is merely an implementation detail of typical Java class libraries. String
does not need to be implemented that way. Indeed, I think I've seen Java String implementations where it WASN'T implemented that way.
Alternative models of string length.
To get the number of Unicode codepoints in a String use str.codePointCount(0, str.length())
-- see the javadoc.
To get the size (in bytes) of a String in some other encoding use str.getBytes(charset).length
.
To deal with locale-specific issues, you can use Normalizer
to normalize the String to whatever form is most appropriate to your use-case, and then use codePointCount
as above.
But in some cases, even this won't work; e.g. the Hungarian letter counting rules which the Unicode standard apparently doesn't cater for.
Using String.length() is generally OK
The reason that most applications use String.length()
is that most applications are not concerned with counting the number of characters in words, texts, etcetera in a human-centric way. For instance, if I do this:
String s = "hi mum how are you";
int pos = s.indexOf("mum");
String textAfterMum = s.substring(pos + "mum".length());
it really doesn't matter that "mum".length()
is not returning code points or that it is not a linguistically correct character count. It is measuring the length of the string using the model that is appropriate to the task at hand. And it works.
Obviously, things get a bit more complicated when you do multilingual text analysis; e.g. searching for words. But even then, if you normalize your text and parameters before you start, you can safely code in terms of "code units" rather than "code points" most of the time; i.e. length()
still works.
1 - This description was on some versions of the question. See the edit history ... if you have sufficient rep points.
java.text.BreakIterator
is able to iterate over text and can report on "character", word, sentence and line boundaries.
Consider this code:
def length(text: String, locale: java.util.Locale = java.util.Locale.ENGLISH) = {
val charIterator = java.text.BreakIterator.getCharacterInstance(locale)
charIterator.setText(text)
var result = 0
while(charIterator.next() != BreakIterator.DONE) result += 1
result
}
Running it:
scala> val text = "Thîs lóo̰ks we̐ird!"
text: java.lang.String = Thîs lóo̰ks we̐ird!
scala> val length = length(text)
length: Int = 17
scala> val codepoints = text.codePointCount(0, text.length)
codepoints: Int = 21
With surrogate pairs:
scala> val parens = "\uDBFF\uDFFCsurpi\u0301se!\uDBFF\uDFFD"
parens: java.lang.String =