This code is not working:
name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]
rules = (
#categories
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'),unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div/a'),unique=True),callback='parse_item'),
Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'),unique=True)),
)
The first rule is getting responses, but the second rule is not working.
I'm sure that the second's rule xpath is correct (I've tried it using scrapy shell ) I also tried adding a callback to the first rule and selecting the path of the second rule ('//div[@id="ItemResultList"]/div/div/div/a') and issuing a Request and it's working correctly.
I also tried a workaround, I tried to use a Base spider instead of a Crawl Spider, it only issues the first request and doesn't issue the callback.
how should I fix that ?
The order of rules is important. According to scrapy docs for CrawlSpider rules
:
If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
If I follow the first link in http://uae.souq.com/ae-en/shop-all-categories/c/, i.e. http://uae.souq.com/ae-en/antique/l/, the items you want to follow are within this structure
<div id="body-column-main">
<div id="box-ads-souq-1340" class="box-container ">...
<div id="box-results" class="box-container box-container-none ">
<div class="box box-style-none box-padding-none">
<div class="bord_b_dash overhidden hidden-phone">
<div class="item-all-controls-wrapper">
<div id="ItemResultList">
<div class="single-item-browse fl width-175 height-310 position-relative">
<div class="single-item-browse fl width-175 height-310 position-relative">
...
So, the links you target with the 2nd Rule are in <div>
that have "fl" in their class, so they also match the first rule, which looks for all links in '//div[@id="body-column-main"]//div[contains(@class,"fl")]'
, and therefore will NOT be parsed with parse_item
Simple solution: Try putting your 2nd rule before the "categories" Rule (unique=True
by default for SgmlLinkExtractor
)
name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div')), callback='parse_item'),
#categories
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'))),
Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'))),
)
Another option is to change your first rule for category pages to a more restrictive XPath, that does not exist in the individual category pages, such as '//div[@id="body-column-main"]//div[contains(@class,"fl")]//ul[@class="refinementBrowser-mainList"]'
You could also define a regex for the category pages and use accept
parameter in you Rules.