Unicode Replacement Characters in the PHP htmlspec

2019-08-28 02:15发布

In the htmlspecialchars function, if you set the ENT_SUBSTITUTE flag, it is supposed to replace some invalid characters.

What characters are replaced? And what is the mapping between the invalid characters and the ones that are used to replace it?

标签： php htmlspecialchars html-sanitizing

1条回答

beautiful°

2楼-- · 2019-08-28 02:56

There is only one, universal replacement character: U+FFFD. If you are writing out UTF-8, then this codepoint is appropriately encoded. If not, you get the corresponding character reference � instead.

There is no reversible mapping. By definition, the original byte sequence was invalid, i.e. it does not have a value (valid = has a value).

Bytes (not really "characters") that are replaced are those that are not valid in the assumed source encoding. For example, if your source encoding was UTF-16 and you had a lone surrogate, that would be "invalid" (though technically any text processor is supposed to abort fatally in that situation). As a better example, if the source encoding is ASCII, then any value above 127 is an invalid character.

0人赞添加讨论(0) 举报

Unicode Replacement Characters in the PHP htmlspec

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间