How to fix scrapy rules when only one rule is foll

This code is not working:

name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]

rules = (
    #categories
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'),unique=True)),
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div/a'),unique=True),callback='parse_item'),
    Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'),unique=True)),
)

The first rule is getting responses, but the second rule is not working. I'm sure that the second's rule xpath is correct (I've tried it using scrapy shell ) I also tried adding a callback to the first rule and selecting the path of the second rule ('//div[@id="ItemResultList"]/div/div/div/a') and issuing a Request and it's working correctly.

I also tried a workaround, I tried to use a Base spider instead of a Crawl Spider, it only issues the first request and doesn't issue the callback. how should I fix that ?

标签： scrapy web-crawler

1条回答

Viruses.

2楼-- · 2019-08-27 21:25

The order of rules is important. According to scrapy docs for CrawlSpider rules:

If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

If I follow the first link in http://uae.souq.com/ae-en/shop-all-categories/c/, i.e. http://uae.souq.com/ae-en/antique/l/, the items you want to follow are within this structure

<div id="body-column-main">
    <div id="box-ads-souq-1340" class="box-container ">...
    <div id="box-results" class="box-container box-container-none ">
        <div class="box box-style-none box-padding-none">
            <div class="bord_b_dash overhidden hidden-phone">
            <div class="item-all-controls-wrapper">
            <div id="ItemResultList">
                <div class="single-item-browse fl width-175 height-310 position-relative">
                <div class="single-item-browse fl width-175 height-310 position-relative">
                ...

So, the links you target with the 2nd Rule are in <div> that have "fl" in their class, so they also match the first rule, which looks for all links in '//div[@id="body-column-main"]//div[contains(@class,"fl")]', and therefore will NOT be parsed with parse_item

Simple solution: Try putting your 2nd rule before the "categories" Rule (unique=True by default for SgmlLinkExtractor)

name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]

rules = (
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div')), callback='parse_item'),

    #categories
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'))),

    Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'))),
)

Another option is to change your first rule for category pages to a more restrictive XPath, that does not exist in the individual category pages, such as '//div[@id="body-column-main"]//div[contains(@class,"fl")]//ul[@class="refinementBrowser-mainList"]'

You could also define a regex for the category pages and use accept parameter in you Rules.

0人赞添加讨论(0) 举报

How to fix scrapy rules when only one rule is foll

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间