Dynamically adding domains to scrapy crawlspider d

I am currently using scrapy's CrawlSpider to look for specific info on a list of multiple start_urls. What I would like to do is stop scraping a specific start_url's domain once I've found the information I've looked for, so it won't keep hitting a domain and will instead just hit the other start_urls.

Is there a way to do this? I have tried appending to deny_domains like so:

deniedDomains = []
...
rules = [Rule(SgmlLinkExtractor(..., deny_domains=(etc), ...)]
...
def parseURL(self, response):
    ...
    self.deniedDomains.append(specificDomain)

Appending doesn't seem to stop the crawling, but if I start the spider with the intended specificDomain then it'll stop as requested. So I'm assuming that you can't change the deny_domains list after the spider's started?

标签： python scrapy

2条回答

趁早两清

2楼-- · 2019-05-30 03:46

Something ala?

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
    name = "foo"
    allowed_domains = ["example.org"]
    start_urls = ["http://www.example.org/foo/",]

    rules = (
        Rule(SgmlLinkExtractor(
            allow=('/foo/[^/+]',),
            deny_domains=('example.com',)),
        callback='parseURL'),
        )

    def parseURL(self, response):

        # here the rest of your code

0人赞添加讨论(0) 举报

Viruses.

3楼-- · 2019-05-30 03:51

The best way to do this , is to maintain your own dynamic_deny_domain list in your Spider class :

write a simple Downloader Middleware,
it's a simple class, with one method implementation: process_request(request, spider):
return IgnoreRequest if the request is in your spider.dynamic_deny_domain list, None otherwise.

Then add your downloaderMiddleWare to Middleware list in scrapy settings , at first position 'myproject.downloadermiddleware.IgnoreDomainMiddleware': 50,

Should do the trick.

0人赞添加讨论(0) 举报

Dynamically adding domains to scrapy crawlspider d

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间