Character encoding fail, why does \\xBD display im

2019-08-09 18:44发布

问题:

I'm just trying to understand character encoding a bit better, so I'm doing a few tests.

I have a PHP file that is saved as UTF-8 and looks like this:

<?php
declare(encoding='UTF-8');

header( 'Content-type: text/html; charset=utf-8' );
?><!DOCTYPE html>

<html>

<head>
    <meta charset="UTF-8" />
    <title>Test</title>
</head>

<body>
    <?php echo "\xBD"; # Does not work ?>
    <?php echo htmlentities( "\xBD" ) ; # Works ?>
</body>

</html>

The page itself shows this:

The gist of the problem is that my web application has a bunch of character encoding problems, where people are copying and pasting from Outlook or Word and the characters get transformed into the diamond question marks (Do those have a real name?)

I'm trying to learn how to make sure all my input is transformed into UTF-8 when the page loads (Basically $_GET, $_POST, and $_REQUEST), and all output is done using proper UTF-8 handling methods.


My question is: Why is my page showing the question mark for the first echo, and does anyone have any other information about making a UTF-8 safe web app in PHP?

回答1:

0xBD is not valid UTF-8. If you want to encode "½" in UTF-8 then you need to use 0xC2 0xBD instead.

>>> print '\xc2\xbd'.decode('utf-8')
½

If you want to use text from another charset (Latin-1 in this case) then you need to transcode it to UTF-8 first using the various iconv or mb functions.

Also:

$ charinfo �
U+FFFD REPLACEMENT CHARACTER


回答2:

\xBD is not valid as utf8 what you want is \xC2\xBD, the question mark thing is what applications replace invalid code points with, so if you see that in your utf8 text its either not utf8 or corrupted.