Getting the the text of an <a> with XPath when it&

The following XPath is usually sufficient for matching all anchors whose text contains a certain string:

//a[contains(text(), 'SENIOR ASSOCIATES')]

Given a case like this though:

<a href="http://www.freshminds.net/job/senior-associate/"><strong>
                        SENIOR ASSOCIATES <br> 
                        </strong></a>

The text is wrapped in a , also there's also a   before the anchor closes, and so the above XPath returns nothing.

How can the XPath be adapted so that it allows for the <a> containing additional tags such as , , ,   etc. while still working in the standard case?

标签： html xml xpath xhtml

1条回答

兄弟一词,经得起流年.

2楼-- · 2019-07-23 19:21

Don't use text().

//a[contains(., 'SENIOR ASSOCIATES')]

Contrary to what you might think, text() does not give you the text of an element.

It is a node test, i.e. an expression that selects a list of actual nodes (!), namely the text node children of an element.

Here:

<a href="http://www.freshminds.net/job/senior-associate/"><strong>
                    SENIOR ASSOCIATES <br> 
                    </strong></a>

there are no text node children of a. All the text nodes are children of strong. So text() gives you zero nodes.

Here:

<a href="http://www.freshminds.net/job/senior-associate/"> <strong>
                    SENIOR ASSOCIATES <br> 
                    </strong></a>

there is one text node child of a. It's empty (as in "whitespace only").

. on the other hand selects only one node (the context node, the <a> itself).

Now, contains() expects strings as its arguments. If one argument is not a string, a conversion to string is done first.

Converting a node set (consisting of 1 or more nodes) to string is done by concatenating all text node descendants of the first node in the set^(*). Therefore using . (or its more explicit equivalent string(.)) gives you SENIOR ASSOCIATES surrounded by a bunch of whitespace, because there is a bunch of whitespace in your XML.

To get rid of that whitespace, use the normalize-space() function:

//a[contains(normalize-space(.), 'SENIOR ASSOCIATES')]

or, shorter, because "the current node" is the default for this function:

//a[contains(normalize-space(), 'SENIOR ASSOCIATES')]

^(*) That's the reason why using //a[contains(.//text(), 'SENIOR ASSOCIATES')] would work in the first of the two samples above but not in the second one.

0人赞添加讨论(0) 举报

Getting the the text of an with XPath when it&

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间