This is my code:
$oDom = new DOMDocument();
$oDom->loadHTML("èàéìòù");
echo $oDom->saveHTML();
This is the output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èà éìòù</p></body></html>
I want this output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>èàéìòù</p></body></html>
I've tried with ...
$oDom = new DomDocument('4.0', 'UTF-8');
or with 1.0 and other stuffs but nothing.
Another thing ...
There is a way to obtain the same untouched HTML?
For example with this html in input <p>hello!</p>
obtain the same output <p>hello!</p>
using DOMDocument only for parsing the DOM and to do some substitutions inside the tags.
Solution:
$oDom = new DOMDocument();
$oDom->encoding = 'utf-8';
$oDom->loadHTML( utf8_decode( $sString ) ); // important!
$sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
$sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!
The saveHTML()
method works differently specifying a node.
You can use the main node ($oDom->documentElement
) adding the desired !DOCTYPE
manually.
Another important thing is utf8_decode()
.
All the attributes and the other methods of the DOMDocument
class, in my case, don't produce the desired result.
Try to set the encoding type after you have loaded the HTML.
$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->encoding = 'utf-8';
echo $dom->saveHTML();
Other way
The issue appears to be known, according to the user comments on the manual page at php.net. Solutions suggested there include putting
<meta http-equiv="content-type" content="text/html; charset=utf-8">
in the document before you put any strings with non-ASCII chars in.
Another hack suggests putting
<?xml encoding="UTF-8">
as the first text in the document and then removing it at the end.
Nasty stuff. Smells like a bug to me.
This way:
/**
* @param string $text
* @return DOMDocument
*/
private function buildDocument($text)
{
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $text);
libxml_use_internal_errors(false);
return $dom;
}
I don't know why the marked answer didn't work for my problem. But this one did.
ref: https://www.php.net/manual/en/class.domdocument.php
<?php
// checks if the content we're receiving isn't empty, to avoid the warning
if ( empty( $content ) ) {
return false;
}
// converts all special characters to utf-8
$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');
// creating new document
$doc = new DOMDocument('1.0', 'utf-8');
//turning off some errors
libxml_use_internal_errors(true);
// it loads the content without adding enclosing html/body tags and also the doctype declaration
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// do whatever you want to do with this code now
?>
Looks like you just need to set substituteEntities when you create the DOMDocument object.