scrapy and xpath function 'matches' syntax

I'm running scrapy 0.20.2.

$ scrapy shell "http://newyork.craigslist.org/ata/"

I would like to make the list of all links to advertisements pages set apart the index.html

$ sel.xpath('//a[contains(@href,html)]')
... 
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atq/4243973984.html">Wicke'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html" class'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html">Recla'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/ata/index100.html" class="butt'>]

I would like to use the XPath matches function to match links the form of the regex [0-9]+.html.

$ sel.xpath('//a[matches(@href,"[0-9]+.html")]')
...
ValueError: Invalid XPath: //a[matches(@href,"[0-9]+.html")]

What's wrong? Thank you.

标签： regex xpath scrapy

2条回答

再贱就再见

2楼-- · 2020-03-05 02:50

matches is an XPath 2.0 function, and scrapy only supports XPath 1.0 (which does not have any regular expression support built in). You'll have to extract all the links using a scrapy selector and then do the regex filtering at the Python level rather than within the XPath.

0人赞添加讨论(0) 举报

Fickle 薄情

3楼-- · 2020-03-05 03:01

For this special use case, there is an XPath 1.0-workaround using translate(...):

//a[
  translate(substring-before(@href, '.html'), '0123456789', '') = ''
  and @href != '.html'
  and substring-after(@href, '.html') = '']

The translate(...) call removes all digits from the name part before the .html extension. The second line check makes sure .html is excluded (nothing before the dot), the last makes sure .html actually is the file extension.

0人赞添加讨论(0) 举报

scrapy and xpath function 'matches' syntax

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间