Scrapy: Dynamically generate rules for each start_

I have created a spider which is supposed to crawl multiple websites and I need to define different rules for each URL in the start_url list.

start_urls = [
    "http://URL1.com/foo"
    "http://URL2.com/bar"
]

rules = [
    Rule (LinkExtractor(restrict_xpaths=("//" + xpathString+"/a")), callback="parse_object", follow=True)
]

The only thing that needs to change in the rule is the xpath string for restrict_xpath. I've already come up with a function that can get the xpath I want dynamically from any website. I figured I can just get the current URL that the spider will be scraping and pass it through the function and then pass the resulting xpath to the rule.

Unfortunately, I've been searching and it seems that this isn't possible since scrapy utilizes a scheduler and compiles all the start_urls and rules right from the start. Is there any workaround to achieve what I'm trying to do?

标签： python xpath scrapy

2条回答

仙女界的扛把子

2楼-- · 2019-06-05 21:21

I assume you are using CrawlSpider. By default, CrawlSpider rules are applied for all pages (whatever the domain) your spider is crawling.

If you are crawling multiple domains in start URLs, and want different rules for each domains, you wont be able to tell scrapy which rule(s) to apply to which domain. (I mean, it's not available out of the box)

You can run your spider with 1 start URL at a time (and domain-specific rules, built dynamically at init time). And run multiple spiders in paralel.

Another option is to subclass CrawlSpider and customize it for your needs:

build rules as a dict using domains as keys, and values being the list of rules to apply for that domain. See _compile_rules method.
and apply different rules depending on the domain of the response. See _requests_to_follow

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2019-06-05 21:27

You can just override the parse method. This method will get a scrapy response object with full html content. You can run xpath on it. You will can also retrieve the url from the response object and depending on the url, you can run custom xpath.

Please checkout the docs here: http://doc.scrapy.org/en/latest/topics/request-response.html

0人赞添加讨论(0) 举报

Scrapy: Dynamically generate rules for each start_

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间