According to the Java SE 7 Specification, Java uses the Unicode UTF-16 standard to represent characters.
When imagining a String
as a simple array of 16-bit variables each containing one character, life is simple.
Unfortunately, there are code points for which 16 bits simply aren't enough (I believe it was 16/17th of all Unicode characters). So in a String
, this poses no direct problem, because when wanting to store one of these ~1.048.576 characters using an additional two bytes, simply two array positions in that String
would be used.
This, without posing any direct problem, works for String
s, because there can always be an additional two bytes. Though when it comes to single variables which, in contrast to the UTF-16 encoding, have a fixed length of 16 bits, how can these characters be stored, and in particular, how does Java do it with its 2-byte "char" type?
The answer is in the javadoc :
Simply said :
Even simpler said :
As an aside, it can be noted that the evolution of Unicode to extend past the BMP made UTF-16 globally irrelevant, now that UTF-16 doesn't even enable a fixed byte-chars ratio. That's why more modern languages are based on UTF-8. This manifesto helps understand it.
Basically, strings store a sequence of UTF-16 code units... which isn't the same as storing a sequence of Unicode code points.
When a character outside the Basic Multilingual Plane is required, that takes up two UTF-16 code units within the
String
.Most
String
operations -length()
,charAt
,substring()
etc deal in numbers of UTF-16 code units. However, there are operations likecodePointAt()
which will deal with full Unicode code points... although the indexes are still expressed in terms of UTF-16 code units.EDIT: If you want to store a non-BMP code point in a single
char
, you're basically out of luck. It's like wanting to store more than 256 distinct values in abyte
variable... it just doesn't work. Following the conventions for representing a code point elsewhere (e.g. inString
) it's best to just use anint
variable.