I have an XHTML document being passed to a PHP app via Greasemonkey AJAX. The PHP app uses UTF8. If I output the POST content straight back to a textarea in the AJAX receiving div, everything is still properly encoded in UTF8.
When I try to parse using XPath
$dom = new DOMDocument();
$dom->loadHTML($raw2);
$xpath = new DOMXPath($dom);
$query = '//td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
var_dump($node->wholeText);
}
dumped strings are not utf8. How do I force DOM/XPath to use UTF8?
If it is a fully fledged valid xhtml document you shouldn't use loadhtml() but load()/loadxml().
Given the example xhtml document
the script
prints
i.e. the output/strings are utf-8 encoded
I have not tried, but the second parameter of
DOMDocument::__construct
seems to be related to the encoding ; maybe that'll help you :-)Else, there is an encoding property in DOMDocument, which is writable.
The DOMXpath beeing constructed with the DOMDocument as parameter, maybe it'll work...
Struggled with similar problem (unable to force Xpath to use UTF-8 in combination with loadHTML), in the end this excellent article provided the solution: http://devzone.zend.com/article/8855
A bit late in the game, but perhaps it helps someone...
The problem might be in the output, and not in the dom/xpath object itself.
If you would output the nodeValue directly, you would get corrupted characters e.g.:
You have to load your dom object with the second param "utf-8",
new \DomDocument('1.0', 'utf-8')
, but still when you print the dom node list/element value you get broken characters:echo $contentItem->item($index)->nodeValue
you have to wrap it up with utf8_decode:
echo utf8_decode($contentItem->item($index)->nodeValue) //output: 者不終朝而會,愚者可浹旬而學
I had the same problem and I couldn't use tidy in my webserver. I found this solution and it worked fine: