Crawl multiple domains with Scrapy without criss-c

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).

class LinkNetworkSpider(CrawlSpider):

    name = "network"
    allowed_domains = ["exampleA.com"]

    start_urls = ["http://www.exampleA.com"]

    rules = (Rule(SgmlLinkExtractor(allow=()), callback='parse_item', follow=True),)

    def parse_start_url(self, response):
        return self.parse_item(response)

    def parse_item(self, response):

        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a/@href').extract()

        outgoing_links = []

        for link in links:
            if ("http://" in link):
                base_url = urlparse(link).hostname
                base_url = base_url.split(':')[0]  # drop ports
                base_url = '.'.join(base_url.split('.')[-2:])  # remove subdomains
                url_hit = sum(1 for i in self.allowed_domains if base_url not in i)
                if url_hit != 0:
                    outgoing_links.append(link)

        if outgoing_links:
            item = LinkNetworkItem()
            item['internal_site'] = response.url
            item['out_links'] = outgoing_links
            return [item]
        else:
            return None

I want to extend this to multiple domains (exampleA.com, exampleB.com, exampleC.com ...). At first, I thought i can just add my list to start_urls as well as allowed_domains but in my opinion this causes following problems:

Will the settings DEPTH_LIMIT be applied for each start_urls/allowed_domain?
More important: If the sites are connected will the spider jump from exampleA.com to exampleB.com because both are in allowed_domains? I need to avoid this criss-cross as I later on want to count the outbound links for each site to gain information about the relationship between the websites!

So how can i scale more spider without running into the problem of criss-crossing and using the settings per website?

Additional image showing what i would like to realize: scrapy

标签： python scrapy

2条回答

Ridiculous、

2楼-- · 2019-02-16 01:53

I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

Therefore, override start_requests:

def start_requests(self):
    return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]

In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.

0人赞添加讨论(0) 举报

淡お忘

3楼-- · 2019-02-16 02:02

You would probably need to keep a data structure (ex a hashmap) of URLs that the crawler has already visited. Then it's just a matter of adding URLs to the hashmap as you visit them and not visiting URLs if they're in the hashmap already (as this means you have already visited them). There are probably more complicated ways of doing this which would give you greater performace, but these would also be harder to implement.

0人赞添加讨论(0) 举报

Crawl multiple domains with Scrapy without criss-c

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间