I'm just trying to understand character encoding a bit better, so I'm doing a few tests.
I have a PHP file that is saved as UTF-8 and looks like this:
<?php
declare(encoding='UTF-8');
header( 'Content-type: text/html; charset=utf-8' );
?><!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<title>Test</title>
</head>
<body>
<?php echo "\xBD"; # Does not work ?>
<?php echo htmlentities( "\xBD" ) ; # Works ?>
</body>
</html>
The page itself shows this:
The gist of the problem is that my web application has a bunch of character encoding problems, where people are copying and pasting from Outlook or Word and the characters get transformed into the diamond question marks (Do those have a real name?)
I'm trying to learn how to make sure all my input is transformed into UTF-8 when the page loads (Basically $_GET
, $_POST
, and $_REQUEST
), and all output is done using proper UTF-8 handling methods.
My question is: Why is my page showing the question mark for the first echo, and does anyone have any other information about making a UTF-8 safe web app in PHP?
0xBD is not valid UTF-8. If you want to encode "½" in UTF-8 then you need to use 0xC2 0xBD instead.
If you want to use text from another charset (Latin-1 in this case) then you need to transcode it to UTF-8 first using the various iconv or mb functions.
Also:
\xBD
is not valid as utf8 what you want is\xC2\xBD
, the question mark thing is what applications replace invalid code points with, so if you see that in your utf8 text its either not utf8 or corrupted.