可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I\'m trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).
$profile = \"<div><p>various japanese characters</p></div>\";
$dom = new DOMDocument();
$dom->loadHTML($profile);
$divs = $dom->getElementsByTagName(\'div\');
foreach ($divs as $div) {
echo $dom->saveHTML($div);
}
The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:
echo $profile;
it displays correctly. I\'ve tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.
What I see:
ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åºã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ãã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å¦ã
What should be shown:
イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学
EDIT: I\'ve simplified the code down to five lines so you can test it yourself.
$profile = \"<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>\";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;
Here is the html that is returned:
<div lang=\"ja\"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åºã«ã€</p></div>
<div lang=\"ja\"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
回答1:
DOMDocument::loadHTML
will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
If your string doesn\'t contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
$profile = \'<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>\';
$dom = new DOMDocument();
$dom->loadHTML(\'<?xml encoding=\"utf-8\" ?>\' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there\'s a workaround in SmartDOMDocument which should help you:
$profile = \'<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>\';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, \'HTML-ENTITIES\', \'UTF-8\'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it\'s the safest alternative.
回答2:
The problem is with saveHTML()
and saveXML()
, both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
The workaround is very simple:
If you try the default, you will get the error you described
$str = $dom->saveHTML(); // saves incorrectly
All you have to do is save as follows:
$str = $dom->saveHTML($dom->documentElement); // saves correctly
This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML()
.
Note
English characters do not cause any problem when you use saveHTML()
without parameters (because English characters are saved as single byte characters in UTF-8)
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)
I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.
回答3:
Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).
Also in case of HTML, make sure you have declared the correct encoding using meta
tags:
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">
If it\'s a CMS (as you\'ve tagged your question with Joomla) you may need to configure appropriate settings for the encoding.
回答4:
You could prefix a line enforcing utf-8
encoding, like this:
@$doc->loadHTML(\'<?xml version=\"1.0\" encoding=\"UTF-8\"?>\' . \"\\n\" . $profile);
And you can then continue with the code you already have, like:
$doc->saveXML()
回答5:
You must feed the DOMDocument a version of your HTML with a header that make sense.
Just like HTML5.
$profile =\'<?xml version=\"1.0\" encoding=\"\'.$_encoding.\'\"?>\'. $html;
maybe is a good idea to keep your html as valid as you can, so you don\'t get into issues when you\'ll start query... around :-) and stay away from htmlentities
!!!! That\'s an an necessary back and forth wasting resources.
keep your code insane!!!!
回答6:
This took me a while to figure out but here\'s my answer.
Before using DomDocument I would use file_get_contents to retrieve urls and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:
$dom = new DomDocument(\'1.0\', \'UTF-8\');
if ($dom->loadHTMLFile($url) == false) { // read the url
// error message
}
else {
// process
}
This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, php settings and all the rest of the remedies offered here and elsewhere. Here\'s what works:
$dom = new DomDocument(\'1.0\', \'UTF-8\');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, \'HTML-ENTITIES\', \'UTF-8\')) == false) {
}
etc. Now everything\'s right with the world. Hope this helps.
回答7:
Works finde for me:
$dom = new \\DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return utf8_encode( $dom->saveHTML());
回答8:
Problem is that when you add parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you\'ll need to avoid the use of the parameter and use old string function to find what your are looking for.
I think the previous answer works for you, but since this workaround didn\'t work for me, I\'m adding that answer to help ppl who may be in my case.
回答9:
Use it for correct result
$dom = new DOMDocument();
$dom->loadHTML(\'<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\' . $profile);
echo $dom->saveHTML();
echo $profile;
This operation
mb_convert_encoding($profile, \'HTML-ENTITIES\', \'UTF-8\');
It is bad way, because special symbols like < ; , > ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.
回答10: