Character encoding fail, why does \xBD display imp

2019-08-09 18:04发布

I'm just trying to understand character encoding a bit better, so I'm doing a few tests.

I have a PHP file that is saved as UTF-8 and looks like this:

<?php
declare(encoding='UTF-8');

header( 'Content-type: text/html; charset=utf-8' );
?><!DOCTYPE html>

<html>

<head>
    <meta charset="UTF-8" />
    <title>Test</title>
</head>

<body>
    <?php echo "\xBD"; # Does not work ?>
    <?php echo htmlentities( "\xBD" ) ; # Works ?>
</body>

</html>

The page itself shows this:

enter image description here

The gist of the problem is that my web application has a bunch of character encoding problems, where people are copying and pasting from Outlook or Word and the characters get transformed into the diamond question marks (Do those have a real name?)

I'm trying to learn how to make sure all my input is transformed into UTF-8 when the page loads (Basically $_GET, $_POST, and $_REQUEST), and all output is done using proper UTF-8 handling methods.


My question is: Why is my page showing the question mark for the first echo, and does anyone have any other information about making a UTF-8 safe web app in PHP?

2条回答
疯言疯语
2楼-- · 2019-08-09 18:50

0xBD is not valid UTF-8. If you want to encode "½" in UTF-8 then you need to use 0xC2 0xBD instead.

>>> print '\xc2\xbd'.decode('utf-8')
½

If you want to use text from another charset (Latin-1 in this case) then you need to transcode it to UTF-8 first using the various iconv or mb functions.

Also:

$ charinfo �
U+FFFD REPLACEMENT CHARACTER
查看更多
别忘想泡老子
3楼-- · 2019-08-09 19:04

\xBD is not valid as utf8 what you want is \xC2\xBD, the question mark thing is what applications replace invalid code points with, so if you see that in your utf8 text its either not utf8 or corrupted.

查看更多
登录 后发表回答