How to get all pages from the whole website using

2019-08-03 03:21发布

问题:

I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy.

class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['https://stackoverflow.com/questions/']

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            url_lnk = link.url
            print (url_lnk)

Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just need to understand what to do.

UPD

The site which I want to observe is https://sevastopol.su/ - this is a local city news website.

The list of all news should be containde here: https://sevastopol.su/all-news

In the bottom of this page you can see page numbers, but if we go to the last page of news we will see that it has number 765 (right now, 19.06.2019) but it shows the last new with a date of 19 June 2018. So the last page shows only the one-year old news. But there are also plenty of news links that are still alive (probably from 2010 year) and can be even found in search page of this site. So that is why I wanted to know if there can be an access to some global link store of this site.

回答1:

This is something you might wanna do to get all the links to the different questions asked. However, I thing your script might get 404 error somewhere within the execution as there are millions links to parse.

Run the script just the way it is:

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ["https://stackoverflow.com/questions/"]

    def parse(self, response):
        for link in response.css('.summary .question-hyperlink::attr(href)').getall():
            post_link = response.urljoin(link)
            yield {"link":post_link}

        next_page = response.css("a[rel='next']::attr(href)").get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url,callback=self.parse)


回答2:

You should write a regular expression (or a similar search function) that looks for <a> tags with a specific class (in the case of so: class="question-hyperlink") and take the href attribute from those elements. This will fetch all the links from the current page.

Then you can also search for the page links (at the bottom). Here you see that those links are /questions?sort=active&page=<pagenumber> where you can change <pagenumber> with the page you want to scrape. (e.g. make a loop that starts at 1 and goes on until you get a 404 error.



回答3:

your spider which now yields requests to crawl subsequent pages

from scrapy.spiders import CrawlSpider
from scrapy import Request
from urllib.parse import urljoin

class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['https://sevastopol.su/all-news']

    def parse(self, response):
        # This method is called for every successfully crawled page

        # get all pagination links using xpath
        for link in response.xpath("//li[contains(@class, 'pager-item')]/a/@href").getall():
            # build the absolute url 
            url = urljoin('https://sevastopol.su/', link)
            print(url)
            yield Request(url=url, callback=self.parse)  # <-- This makes your spider recursiv crawl subsequent pages

note that you don't have to worry about requesting the same url multiple times. Duplicates are dropped by scrapy (default settings).

Next steps:

  • Configure Scrapy (e.g User Agent, Crawl Delay, ...): https://docs.scrapy.org/en/latest/topics/settings.html

  • Handle Errors (errback): https://docs.scrapy.org/en/latest/topics/request-response.html

  • Use Item Piplines to store your URLs etc.: https://docs.scrapy.org/en/latest/topics/item-pipeline.html