scrapy HtmlXPathSelector determine xpath by search

2019-08-20 05:46发布

I have a portion of html like below

<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>

I want to get the string "The keyword: The text".

I know that I can get xpath of above html using Chrome inspect or FF firebug, then hxs.select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.

Hence, I'm thinking of below approach: Firstly, search for "The Keyword:" using

hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')

When do pprint I get some return:

>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>

My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.

I am open to any solution other than scrapy HtmlXPathSelector. ( e.g lxml.html might have more features but I am very new to it).

Thanks.

标签： xpath lxml scrapy

1条回答

We Are One

2楼-- · 2019-08-20 06:29

If I understand your question correctly, "following-sibling" is what you are looking after.

 //*[contains(text(), "The Keyword:")]/following-sibling::span/a/text()

Xpath Axes

0人赞添加讨论(0) 举报

scrapy HtmlXPathSelector determine xpath by search

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间