I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls
only a certain depth via e.g. DEPTH_LIMIT = 2
).
class LinkNetworkSpider(CrawlSpider):
name = "network"
allowed_domains = ["exampleA.com"]
start_urls = ["http://www.exampleA.com"]
rules = (Rule(SgmlLinkExtractor(allow=()), callback='parse_item', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a/@href').extract()
outgoing_links = []
for link in links:
if ("http://" in link):
base_url = urlparse(link).hostname
base_url = base_url.split(':')[0] # drop ports
base_url = '.'.join(base_url.split('.')[-2:]) # remove subdomains
url_hit = sum(1 for i in self.allowed_domains if base_url not in i)
if url_hit != 0:
outgoing_links.append(link)
if outgoing_links:
item = LinkNetworkItem()
item['internal_site'] = response.url
item['out_links'] = outgoing_links
return [item]
else:
return None
I want to extend this to multiple domains (exampleA.com, exampleB.com, exampleC.com ...). At first, I thought i can just add my list to start_urls
as well as allowed_domains
but in my opinion this causes following problems:
- Will the settings
DEPTH_LIMIT
be applied for eachstart_urls
/allowed_domain
? - More important: If the sites are connected will the spider jump from exampleA.com to exampleB.com because both are in allowed_domains? I need to avoid this criss-cross as I later on want to count the outbound links for each site to gain information about the relationship between the websites!
So how can i scale more spider without running into the problem of criss-crossing and using the settings per website?
Additional image showing what i would like to realize:
I have now achieved it without rules. I attached a
meta
attribute to everystart_url
and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.Therefore, override
start_requests
:In subsequent parsing methods we grab the
meta
attributedomain = response.request.meta['domain']
, compare the domain with the extracted links and sent out new requests ourselves.You would probably need to keep a data structure (ex a hashmap) of URLs that the crawler has already visited. Then it's just a matter of adding URLs to the hashmap as you visit them and not visiting URLs if they're in the hashmap already (as this means you have already visited them). There are probably more complicated ways of doing this which would give you greater performace, but these would also be harder to implement.