htmlentities() makes Chinese characters unusable

2020-03-23 17:36发布

问题:

we have a web application where we allow users to enter their own html in a text area. We save that data to our database.

When we load the html data into the text area, of course, we use htmlentities() before throwing the html data into the textarea. Otherwise users could save inside the textarea and our application would break when loading that into the textarea.

this works great, except when entering Chinese characters (and probably other languages such as Arabic, Japanese).

The htmlentities() makes the chinese text unusable like this: �¨�³�¼�§ï When I remove the htmlentities() before loading the entered html into the text area, Chinese characters show up just fine, but then we have the problem of HTML interfering with our textarea, especially when a users enters inside the text area.

I hope that makes sense.

Does anyone know how we can safely and correctly allow languages such as Chinese, Japanese, ... to be used inside our text area, while still being safe for loading any html inside our text area?

回答1:

Have you tried using htmlspecialchars?

I currently use that in production and it's fine.

$foo = "我的名字叫萨沙"
echo '<textarea>' . htmlspecialchars($foo) . '</textarea>';

Alternately,

$str = “&#20320;&#22909;”;
echo mb_convert_encoding($str, ‘UTF-8′, ‘HTML-ENTITIES’);

As found on http://www.techiecorner.com/129/php-how-to-convert-iso-character-htmlentities-to-utf-8/



回答2:

Specify charset, e.g. UTF-8 and it should work.

echo htmlentities($data, ENT_COMPAT, 'UTF-8'); 


回答3:

PHP is pretty appalling in terms of framework-wide support for international character sets (although it's slowly getting better, especially in PHP5, but you don't specify which version you're using). There are a few mb_ (multibyte, as in multibyte characters) functions to help you out, though.

This example may help you (from here):

<?php 
/** 
 *  Multibyte equivalent for htmlentities() [lite version :)] 
 * 
 * @param string $str 
 * @param string $encoding 
 * @return string 
 **/ 
function mb_htmlentities($str, $encoding = 'utf-8') { 
    mb_regex_encoding($encoding); 
    $pattern = array('<', '>', '"', '\''); 
    $replacement = array('&lt;', '&gt;', '&quot;', '&#39;'); 
    for ($i=0; $i<sizeof($pattern); $i++) { 
        $str = mb_ereg_replace($pattern[$i], $replacement[$i], $str); 
    } 
    return $str; 
} 
?>

Also, make sure your page is specifying the same character set. You can do this with a meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


回答4:

Most likely you're not using the correct encoding. If you already know your output encoding, use the charset argument of the html_entities function.

If you haven't settled on an internal encoding yet, take a look at the iconv functions; iconv_set_encoding("internal_encoding", "UTF-8"); might be a good start.