php xpath with text() and SimpleXMLElement->xpath

2019-05-05 18:03发布

问题:

I'm trying to get all text nodes of /td/span.

I'm trying with xpath /td/span/text()

The problem is it's returning ALL the text nodes for every text element (there are two here, "193" and "120", it returns "193120" twice, instead of 193 and 120 in separate elements).

I try the exact same xpath on any online tool, it works fine, in php, completely different results.

using SimpleXMLElement

$xhtmlSnippet = '<td><span>193<span>10</span><span></span><div>66</div><span>195</span><span>.</span><span>34</span><span>242</span><span></span>120<span>64</span></span></td>';

$xml = new SimpleXMLElement($xhtmlSnippet);

$xresult = $xml->xpath('/td/span/text()');    

foreach($xresult as $xnode){
    echo "<br /><br />NodeValue: " . $xnode;
}

Gives me:

NodeValue: 193120

NodeValue: 193120

Here is an example of it working properly via an online tool (ALL of the other online tools give the expected output also):

Working example in online tester

EDIT:

Using DOMDocument + DOMXPath, it seems to work as expected:

    $dom = new DOMDocument;
    $dom->loadXML($xhtmlSnippet);

    $xpath = new DOMXPath($dom);

    foreach ($xpath->query('/td/span/text()) as $textNode) {
        echo "\n\nTextNode: " . $textNode->nodeValue;
    }

Gives:

TextNode: 193

TextNode: 120

回答1:

A SimpleXMLElement can only represent elements and attributes, either individually or a collection of siblings of the same type. The ->xpath() method returns an array of SimpleXMLElement objects, which allows them to be non-siblings, but does not allow for any other node type.

Consequently, the expression /td/span/text() matches the two text nodes, but returns them as objects representing their parent element, which in this case happens to be the same <span> element, giving you an array with the same object in twice.

The remaining part of the puzzle is that when you cast a SimpleXML element to string it combines all its direct descendant text and CDATA nodes into one string, so the 193 and 120 get stuck together.

Thus the output is 193120, twice.

(This is definitely unintuitive behaviour, although it's hard to know quite what SimpleXML should do in this situation; perhaps it would be better to produce an error if the XPath expression resolves to something other than elements or attributes).


Since the DOM API has objects for every kind of node that can possibly exist in XML, and PHP includes a full implementation of that API, the XPath expression will work as expected there. What's more, the SimpleXML and DOM objects are actually both wrappers around the same internal memory structures, so you can write operations combining the two using dom_import_simplexml() and simplexml_import_dom().

As a slightly inelegant example, if you wanted to run an XPath expression in the context of an element you'd already traversed to with SimpleXML, you could do something like this:

$dom_node = dom_import_simplexml($simplexml_node);
$dom_xpath = new DOMXPath($dom_node->ownerDocument);
$dom_xpath_result = $dom_xpath->query('span/text()', $dom_node);

foreach($dom_xpath_result as $xnode){
    echo "<br /><br />NodeValue: " . $xnode->nodeValue;
}

Obviously, you could wrap this up into a function as desired. Also note that since your expression starts at the document root (leading /) the actual context is irrelevant, which is why I've used a slightly different expression above.