Crawl a full domain and load all h1 into a item

2019-08-28 05:46发布

I am relatively new to python and scrapy. What I want to achieve is to crawl a number of websites mainly company websites. Crawl the full domain and extract all the h1 h2 h3. Create a record that contains the domain name and a string with all the h1 h2 h3 from that domain. Basically have a Domain item and a large string containing all the headers.

I would like the output to be DOMAIN, STRING(h1,h2,h2) - from all the urls on this domain

The problem I have is that each URL goes into separate Items. I know I haven't gotten very far but a hint in the right direction would be very much appreciated. Basically, how I create an outer loop so that the yield statement keeps going until the next domain is up.

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html


class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
    start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]


    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_xpath('h1',("//h1/text()"))
        loader.add_xpath('h2',("//h2/text()"))
        loader.add_xpath('h3',("//h3/text()"))
        yield loader.load_item()

标签: python scrapy
1条回答
对你真心纯属浪费
2楼-- · 2019-08-28 06:29

yield statement keeps going until the next domain is up.

cannot be done, things are done in parallel and there is no way to make domain crawling serially.

what you can do is to write a pipeline that will accumulate them and yield the entire structure on spider_close, something like:

# this assume your item looks like the following
class MyItem():
    domain = Field()
    hs = Field()


import collections
class DomainPipeline(object):

    accumulator = collections.defaultdict(set)

    def process_item(self, item, spider):
        accumulator[item['domain']].update(item['hs'])

    def close_spider(spider):
        for domain,hs in accumulator.items():
            yield MyItem(domain=domain, hs=hs)

usage:

>>> from scrapy.item import Item, Field
>>> class MyItem(Item):
...     domain = Field()
...     hs = Field()
... 
>>> from collections import defaultdict
>>> accumulator = defaultdict(set)
>>> items = []
>>> for i in range(10):
...     items.append(MyItem(domain='google.com', hs=[str(i)]))
... 
>>> items
[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain': 'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com', 'hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'domain': 'google.com', 'hs': ['9']}]
>>> for item in items:
...     accumulator[item['domain']].update(item['hs'])
... 
>>> accumulator
defaultdict(<type 'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})
>>> for domain, hs in accumulator.items():
...     print MyItem(domain=domain, hs=hs)
... 
{'domain': 'google.com',
 'hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}
>>> 
查看更多
登录 后发表回答