Why do Unicode code points appear as U+
<codepoint>
?
For example, U+2202
represents the character ∂.
Why not U-
(dash or hyphen character) or anything else?
Why do Unicode code points appear as U+
<codepoint>
?
For example, U+2202
represents the character ∂.
Why not U-
(dash or hyphen character) or anything else?
It depends on what version of the Unicode standard you are talking about. From Wikipedia:
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.
The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols
U+HHHH
andU-HHHHHHHH
(p. 559).The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:
u'xyz'
to indicate a Unicode string, a sequence of Unicode characters'\uxxxx'
to indicate a string with a unicode character denoted by four hex digits'\Uxxxxxxxx'
to indicate a string with a unicode character denoted by eight hex digitsIt is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (
0xB9
orB9h
). Why0xB9
and not0hB9
(or&hB9
or$B9
)? Just because that's how the coin flipped :-)