The following XPath is usually sufficient for matching all anchors whose text contains a certain string:
//a[contains(text(), 'SENIOR ASSOCIATES')]
Given a case like this though:
<a href="http://www.freshminds.net/job/senior-associate/"><strong>
SENIOR ASSOCIATES <br>
</strong></a>
The text is wrapped in a <strong>
, also there's also a <br>
before the anchor closes, and so the above XPath returns nothing.
How can the XPath be adapted so that it allows for the <a>
containing additional tags such as <strong>
, <i>
, <b>
, <br>
etc. while still working in the standard case?
Don't use text()
.
//a[contains(., 'SENIOR ASSOCIATES')]
Contrary to what you might think, text()
does not give you the text of an element.
It is a node test, i.e. an expression that selects a list of actual nodes (!), namely the text node children of an element.
Here:
<a href="http://www.freshminds.net/job/senior-associate/"><strong>
SENIOR ASSOCIATES <br>
</strong></a>
there are no text node children of a
. All the text nodes are children of strong
. So text()
gives you zero nodes.
Here:
<a href="http://www.freshminds.net/job/senior-associate/"> <strong>
SENIOR ASSOCIATES <br>
</strong></a>
there is one text node child of a
. It's empty (as in "whitespace only").
.
on the other hand selects only one node (the context node, the <a>
itself).
Now, contains()
expects strings as its arguments. If one argument is not a string, a conversion to string is done first.
Converting a node set (consisting of 1 or more nodes) to string is done by concatenating all text node descendants of the first node in the set(*). Therefore using .
(or its more explicit equivalent string(.)
) gives you SENIOR ASSOCIATES
surrounded by a bunch of whitespace, because there is a bunch of whitespace in your XML.
To get rid of that whitespace, use the normalize-space()
function:
//a[contains(normalize-space(.), 'SENIOR ASSOCIATES')]
or, shorter, because "the current node" is the default for this function:
//a[contains(normalize-space(), 'SENIOR ASSOCIATES')]
(*) That's the reason why using //a[contains(.//text(), 'SENIOR ASSOCIATES')]
would work in the first of the two samples above but not in the second one.