XPath text with children

Given this html:

<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>

How can I use XPath to get the following result:

[
    'This is a link',
    'This is another link.'
]

What I've tried:

//ul/li/text()

But this gives me ['This is ', 'This is .'] (withoug the text in the a tags

Also:

string(//ul/li)

But this gives me ['This is a link'] (so only the first element)

Also

//ul/li/descendant-or-self::text()

But this gives me ['This is ', 'a link', 'This is ', 'another link', '.']

Any further ideas?

标签： html xpath scrapy

2条回答

不美不萌又怎样

2楼-- · 2020-04-30 02:17

@Tomalak is correct in saying that XPath generally cannot select that which is not there.

However, in this case, the results you want are the string values of li elements. As you've found,

string(//ul/li)

gets you close but only returns the first desired string.

This points to a shortcoming in XPath 1.0 that was addressed in XPath 2.0.

In XPath 1.0, you have to iterate over the nodeset selected by //ul/li outside of XPath -- in XSLT, Python, Java, etc.

In XPath 2.0, the last location step can be a function, so you can use,

//ul/li/string()

to directly return

This is a link
This is another link.

as requested.

This is more educational than practical if you're stuck with Scrapy, which only supports XPath 1.0, but knowing

XPath 1.0 only passes the first of a nodeset to string(),
XPath 2.0 allows the last location step to be a function, and
there's a difference between text() nodes and string values

is generally helpful in reasoning about XPath text selections.

0人赞添加讨论(0) 举报

smile是对你的礼貌

3楼-- · 2020-04-30 02:25

XPath generally cannot select what is not there. These things do not exist in your HTML:

[
    'This is a link',
    'This is another link.'
]

They might exist conceptually on the higher abstraction level that is the browser's rendering of the source code, but strictly speaking even there they are separate, for example in color and functionality.

On the DOM level there are only separate text nodes and that's all XPath can pick up for you.

Therefore you have three options.

Select the text() nodes and join their individual values in Python code.
Select the <li> elements and for each of them, evaluate string(.) or normalize-space(.) with Scrapy. normalize-space() would deal with whitespace the way you would expect it.
Select the <li> elements and access their .text property – which internally finds all descendant text nodes and joins them for you.

Personally I would go for the latter with //ul/li as my basic XPath expression as this would result in a cleaner solution.

As @paul points out in the comments, Scrapy offers a nice fluent interface to do multiple processing steps in one line of code. The following code implements variant #2:

selector = scrapy.Selector(text='''<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>''')

selector.css('ul > li').xpath('normalize-space()').extract()
# --> [u'This is a link', u'This is another link.']

0人赞添加讨论(0) 举报

XPath text with children

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间