Scrapy - Select specific link based on text

This should be easy but I'm stuck.

<div class="paginationControl">
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">Link Text 2</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=3&amp;powerunit=2">Link Text 3</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=4&amp;powerunit=2">Link Text 4</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=5&amp;powerunit=2">Link Text 5</a> |   

<!-- Next page link --> 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">Link Text Next ></a>
</div>

I'm trying to use Scrapy (Basespider) to select a link based on it's Link text using:

nextPage = HtmlXPathSelector(response).select("//div[@class='paginationControl']/a/@href").re("(.+)*?Next")

For example, I want to select the next page link based on the fact that it's text is "Link Text Next". Any ideas?

标签： python web-crawler scrapy

3条回答

家丑人穷心不美

2楼-- · 2020-08-23 02:27

Your xpath is selecting the href not the text in the a tag. It doesn't look from your example like the href has next in it, so you can't find it with an RE.

0人赞添加讨论(0) 举报

Viruses.

3楼-- · 2020-08-23 02:50

Use a[contains(text(),'Link Text Next')]:

nextPage = HtmlXPathSelector(response).select(
    "//div[@class='paginationControl']/a[contains(text(),'Link Text Next')]/@href")

Reference: Documentation on the XPath contains function

PS. Your text Link Text Next has a space at the end. To avoid having to include that space in the code:

text()="Link Text Next "

I think using contains is a bit more general while still being specific enough.

0人赞添加讨论(0) 举报

Lonely孤独者°

4楼-- · 2020-08-23 02:52

You can use the following XPath expression:

//div[@class='paginationControl']/a[text()="Link Text Next"]/@href

This selects the href attributes of the link with text "Link Text Next".

See XPath string functions if you need more control.

0人赞添加讨论(0) 举报

Scrapy - Select specific link based on text

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间