Spec justification for € to Ÿ in UTF-8 d

The HTML 4.01 spec says for hexadecimal character references

Numeric character references specify the code position of a character in the document character set.

So if the document character set encoding is UTF-8, the numeric references should specify a Unicode code point.

The HTML5 spec says for hexadecimal character references

The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more digits in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, representing a base-sixteen integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

No mention is made of the document character set, and it simply says that the numeric value identifies a Unicode code point.

But it seems that all the modern browsers (I haven't tested older ones) treat  through  as if they were referencing Windows-1252

For example,  displays €, but U+0080 isn't the code point for €, U+20AC is. And the Unicode code point for U+0080 is defined as PAD

€ also (correctly) displays €.

Is this simply pragmatic behaviour by browsers or is there a justification in a specification that I'm missing?

[Note that decimal character references have the same behaviour. I've just used the hexadecimal ones for clarity and consistency.]

标签： html utf-8 windows-1252 character-reference

2条回答

倾城　Initia

2楼-- · 2019-06-24 03:18

As I have done here as well, I'll quote Wikipedia again:

Numeric references always refer to Unicode code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so , for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

So it seems to be a legacy issue.

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-06-24 03:34

I found the answer to my question. It's in the tokenization section of the parsing algorithm in HTML5 for consume a character reference, which defines the mapping for these characters.

0人赞添加讨论(0) 举报

Spec justification for € to Ÿ in UTF-8 d

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间