What would cause DOMNode::nodeValue to be empty?

2020-07-22 04:52发布

问题:

I'm currently trying to parse a document with DOMDocument, and I'm having some serious problems. I created a script that runs fine on php 5.2.9, ripping out content using DOMNode::nodeValue. The same script fails to get any content on php 5.3.3 - even though it correctly navigates to the proper nodes to extract content.

Basically, the code used looks like this:

$dom = new DOMDocument();
$dom->loadHTML($data);
$dom->preserveWhiteSpace = false; 
$xpath = new DOMXpath($dom);
$nodelist = $xpath->query($query);
$value = $nodelist->item(0)->nodeValue;

I've checked to make sure that item(0) is in fact a node - it's there and even of the right type, but nodeValue is empty.

The script works on some documents but not others (on 5.3.3) - on 5.2.9 it works on all documents, returning the proper nodeValue.

回答1:

I seem to have missed something basic and/or a bug (though if the bug is in php or libxml I don't know). Basically, the issue is fixed by making sure the data loaded with loadHTML is UTF-8 encoded. Mind you, it's not the entire document that needs to be UTF-8 encoded - the problem here was that there was a character in the element which wasn't in UTF-8. That then threw off everything else in the document handling.

What gets me is that this basically meant all document content was thrown out - but the structure was in place working normally. No errors or anything to suggest the content was seen as invalid.