I am relatively new to python and scrapy. What I want to achieve is to crawl a number of websites mainly company websites. Crawl the full domain and extract all the h1 h2 h3. Create a record that contains the domain name and a string with all the h1 h2 h3 from that domain. Basically have a Domain item and a large string containing all the headers.
I would like the output to be DOMAIN, STRING(h1,h2,h2) - from all the urls on this domain
The problem I have is that each URL goes into separate Items. I know I haven't gotten very far but a hint in the right direction would be very much appreciated. Basically, how I create an outer loop so that the yield statement keeps going until the next domain is up.
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html
class MySpider(CrawlSpider):
name = 'Webcrawler'
allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
xpath = HtmlXPathSelector(response)
loader = XPathItemLoader(item=WebsiteItem(), response=response)
loader.add_xpath('h1',("//h1/text()"))
loader.add_xpath('h2',("//h2/text()"))
loader.add_xpath('h3',("//h3/text()"))
yield loader.load_item()
cannot be done, things are done in parallel and there is no way to make domain crawling serially.
what you can do is to write a pipeline that will accumulate them and yield the entire structure on
spider_close
, something like:usage: