$dom = new DOMDocument();
$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
$elements = $dom->getElementsByTagName('span');
$spans = array();
foreach($elements as $span) {
$spans[] = $span;
}
foreach($spans as $span) {
$span->parentNode->removeChild($span);
}
return $dom->saveHTML();
//return $string;
When I use this code to parse string it changes encoding and symbols are not shown the same when return $string
is uncommented. Why is that so and how to avoid charset change
Ile
Unfortunately, it seems that DOMDocument
will automatically convert all characters to HTML entities unless it knows the encoding of the original document.
Apparently, one option is to add a <meta>
tag with the content type/encoding to the original string, but this means that it will be present in the output as well. Removing it might not be so easy.
Another option I can think of is manually decoding the HTML entities, using a code like this:
$trans = array_flip(get_html_translation_table(HTML_ENTITIES));
unset($trans["""], $trans["<"], $trans[">"], $trans["&"]);
echo strtr($dom->saveHTML(), $trans);
This is a seriously ugly solution, but I can't think of anything else, other than using a different HTML parser. :(
Try to set the encoding in the constructor or with DOMDocument->encoding
:
$dom = new DOMDocument('1.0', '…');
// or
$dom = new DOMDocument();
$dom->encoding = '…';
There is also one interesting thing I noticed today... I didn't realized why it happens but it's very strange behavior... code from the top is set to function. When string is passed to function and after function process it to returned string is added <doctype...> <html><body>STRING</body></html>
in some unexplainable cases:
Data is loaded from database and when this data from db is directly proceeded to function it doesnt add this extra tags, but when data is first stored to variable and than this function is called somewhere below these extra values are added.
Also one strange thing...
I had a case when I called this extra function to process string and few lines below I added trim function it returned me error from dom function... and when I delete this trim function (that was called AFTER the dom function) the error disappeared... Any reasonable explanation?