Why is XPath unclean constructed? Why is text() no

2020-02-26 12:59发布

Assume I have:

<A>
  <B>C</B>
  <D>E</D>
</A>

Then I can output the B-element (including tags) with:

//B

Which will return

<B>C</B>

But why is text() not needed in a predicate? The following 2 lines give the same output:

/A[B = 'C']/D
/A[B/text() = 'C']/D

If XPATH was cleanly constructed I would expect it would be (or in some kind of other element structure):

/A[B = <B>C></B>]/D

and:

/A[B/text()='C']/D

Can someone give me a rationale why text() is needed for output, but it is not needed for predicates?

标签: xml xpath
1条回答
我只想做你的唯一
2楼-- · 2020-02-26 13:17

I think it's a reasonable and natural question. I would rather see people asking conceptual questions like this, to understand how XPath works, than settle for a shallow understanding of XPath and end up asking shallow questions about why their XPath expression didn't do what they expected in scraping data from a certain web page.

Let's clear up some terms first. By "output", I assume you mean the same as "return": the value that an XPath expression selects. (XPath per se has no direct output capability.) By "cleanly constructed" I'm going to assume you mean "simply and consistently designed."

The short answer is that XPath is consistent, but like most flexible and powerful tools, it's not simple.

Next, we might need to ask which version of XPath you're thinking of. There are large differences between versions 1, 2, and 3. I will focus on XPath 1.0 because it's the most well-known and widely implemented, and I don't know 2.0 or 3.0 as well.

The B means the same thing whether it's in a predicate or not. Both in //B and in /A[B = 'C'], it's a node test. It matches (selects) element nodes named B. XPath knows nothing about tags. It operates on an abstract tree document model. An XPath expression can select elements and other nodes, but never tags.

So I think your question then reduces to, why does /A[B = 'C']/D succeed in selecting the D element in the XML sample you provided, when B selects an element rather than just the text 'C'? To reduce it further, why does B = 'C' evaluate as true for element A, when B is an element and not merely a text node containing 'C'?

The answer is, when performing comparisons such as =,

If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true [emphasis added].

In other words, the sub-expression B could select multiple element nodes here, if /A had multiple child elements named B. (In this case, there is only one such child element.) To evaluate the expression B = 'C', XPath looks at the string value of each node selected by B. According to the docs,

The string value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

In this case, the only text node descendant of the B element node is the text node whose string-value is 'C'. Therefore the string-value of B is 'C', and so the predicate [B = 'C'] is true for element /A.

Why does XPath define the string value of an element node in this way? I'm guessing it's partly because of the convenience in the case of single text nodes, but when it comes to free-form marked-up text, like

<p>HTML that <em>could</em> have <b>arbitrary <tt>nesting</tt></b></p>

whose markup you sometimes want to ignore for certain purposes, it can be very handy to quickly retrieve the concatenation of all descendant text nodes.

The other part of your question was, why wouldn't you write

/A[B = <B>C</B>]/D

or

/A[B/text()='C']/D

The second one has the shortest answer: you can. It's just a little less convenient, and less powerful, but it is more explicit and precise. It wouldn't give you the same results all the time, because this version doesn't ask about the string-value of B; it asks whether (any) B has any text node child whose value is 'C', instead of asking whether any B has a concatenation of all descendant text nodes that yield 'C'.

As for /A[B = <B>C</B>]/D, XPath (1.0 at least) wasn't designed with a syntax for creating new nodes, such as <B>C</B>. But even if it were, what would B = <B>C</B> mean? You obviously aren't asking for an identity comparison but a sort of structural equivalence. The XPath definers would have to create a semantics of comparison where a comparison between two node-sets, or between a node-set and a newly defined type such as "structural template", is true if and only if (for example) there is a node in the (first) node-set that recursively matches the structure of the structural template, or of a node in the second node-set. But instead they defined it as follows,

If both objects to be compared are node-sets, then the comparison will be true if and only if there is a node in the first node-set and a node in the second node-set such that the result of performing the comparison on the string-values of the two nodes is true.

Given that they can only choose one of the two definitions for comparison of node-sets, why did they choose the latter instead of the definition you expected? I'm not privy to the proceedings of the XPath committee, but I suspect it came down to the latter definition being more in line with the most common use cases they had analyzed, with consideration also given to performance and simplicity of implementation.

I agree that this definition is not the most obvious way to define = comparison. But I think the designers were right, that comparing whole node tree structures is not a very common use case, whereas the common use cases (such as the one you gave) are well-covered by the tools that XPath does provide. For example, it's very simple in XPath to ask whether there is an A element that is a child of the root node, that has a child B element, whose text value (ignoring all sub-markup for the moment) is 'C'.

查看更多
登录 后发表回答