DOMDocument encoding problems / characters transfo

2019-01-24 12:04发布

问题:

I am using DOMDocument to manipulate / modify HTML before it gets output to the page. This is only a html fragment, not a complete page. My initial problem was that all french character got messed up, which I was able to correct after some trial-and-error. Now, it seems only one problem remains : ' character gets transformed into ? .

The code :

<?php
    $dom = new DOMDocument('1.0','utf-8');
         $dom->loadHTML(utf8_decode($row->text));

         //Some pretty basic modification here, not even related to text

         //reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto.
         $row->text = utf8_encode(preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML())));
?>

I know it's getting messy with the utf8 decode/encode, but this is the only way I could make it work so far. Here is a sample string :

Input : Sans doute parce qu’il vient d’atteindre une date déterminante dans son spectaculaire cheminement

Output : Sans doute parce qu?il vient d?atteindre une date déterminante dans son spectaculaire cheminement

If I find any more details, I'll add them. Thank you for your time and support!

回答1:

Don't use utf8_decode. If your text is in UTF-8, pass it as such.

Unfortunately, DOMDocument defaults to LATIN1 in case of HTML. It seems the behavior is this

  • If fetching a remote document, it should deduce the encoding from the headers
  • If the header wasn't sent or the file is local, look for the correspondent meta-equiv
  • Otherwise, default to LATIN1.

Example of it working:

<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);

echo $d->textContent;

And with XML (default is UTF-8):

<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
    'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);

echo $d->textContent;


回答2:

loadHtml() doesn't always recognize the correct encoding as specified in the Content-type HTTP-EQUIV meta tag.

If the DomDocument('1.0', 'UTF-8') and loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . $html) hacks don't work as they didn't for me (PHP 5.3.13), try this:

Add another <head> section immediately after the opening <html> tag with the correct Content-type HTTP-EQUIV meta tag. Then call loadHtml(), then remove the extra <head> tag.

// Ensure entire page is encoded in UTF-8
$encoding = mb_detect_encoding($body);
$body = $encoding ? @iconv($encoding, 'UTF-8', $body) : $body;

// Insert a head and meta tag immediately after the opening <html> to force UTF-8 encoding
$insertPoint = false;
if (preg_match("/<html.*?>/is", $body, $matches, PREG_OFFSET_CAPTURE)) {
    $insertPoint = mb_strlen( $matches[0][0] ) + $matches[0][1];
}
if ($insertPoint) {
    $body = mb_substr(
        $body,
        0,
        $insertPoint
    ) . "<head><meta http-equiv='Content-type' content='text/html; charset=UTF-8' /></head>" . mb_substr(
        $body,
        $insertPoint
    );
}
$dom = new DOMDocument();

// Suppress warnings for loading non-standard html pages
libxml_use_internal_errors(true);
$dom->loadHTML($body);
libxml_use_internal_errors(false);

// Now remove extra <head>

See this article: http://devzone.zend.com/1538/php-dom-xml-extension-encoding-processing/



回答3:

This was enough for me, the other answers here were overkill. Given I have an HTML document with an existing HEAD tag. HEAD tags don't have attributes and I had no issues leaving the extra META tag in the HTML for my use-case.

$data = str_ireplace('<head>', '<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />', $data);
$document = new DOMDocument();
$document->loadHTML($data);


回答4:

As others have pointed out, DOMDocument and LoadHTML will default to LATIN1 encoding with HTML fragments. It will also wrap your HTML with something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>YOUR HTML</body></html>

So also as others have pointed out, you can fix the encoding by inserting a HEAD element into your HTML with a META element that contains the correct encoding.

However, if you're working with an HTML fragment, you probably don't want the wrapping to happen and you also don't want to keep that HEAD element you inserted.

The following code will insert the HEAD element, and then after processing, using regex will remove all the wrapping elements:

<?php
$html = '<article class="grid-item"><p>Hello World</p></article><article class="grid-item"><p>Goodbye World</p></article>';
$head = '<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>';

libxml_use_internal_errors(true);
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($head . $html);
$xpath = new DOMXPath($dom);

// Loop through all article.grid-item elements and add the "invisible" class to them
$nodes = $xpath->query("//article[contains(concat(' ', normalize-space(@class), ' '), ' grid-item ')]");
foreach($nodes as $node) {
  $class = $node->getAttribute('class');
  $class .= ' invisible';
  $node->setAttribute('class', $class);
}

$content = preg_replace('/<\/?(!doctype|html|head|meta|body)[^>]*>/im', '', $dom->saveHTML());
libxml_use_internal_errors(false);

echo $content;
?>