Scrapy get all links from any website

2020-07-13 13:10发布

问题:

I have the following code for a web crawler in Python 3:

import requests
from bs4 import BeautifulSoup
import re

def get_links(link):

    return_links = []

    r = requests.get(link)

    soup = BeautifulSoup(r.content, "lxml")

    if r.status_code != 200:
        print("Error. Something is wrong here")
    else:
        for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
            return_links.append(link.get('href')))

def recursive_search(links)
    for i in links:
        links.append(get_links(i))
    recursive_search(links)


recursive_search(get_links("https://www.brandonskerritt.github.io"))

The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.

I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.

This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.

I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.

Any help will be appreciated!

回答1:

If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.

A simple spider that follows all links:

class FollowAllSpider(CrawlSpider):
    name = 'follow_all'

    start_urls = ['https://example.com']
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
        pass


回答2:

There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.

For recreating the behaviour you need in scrapy, you must

  • set your start url in your page.
  • write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls

An untested example (that can be, of course, refined):

class AllSpider(scrapy.Spider):
    name = 'all'

    start_urls = ['https://yourgithub.com']

    def __init__(self):
        self.links=[]

    def parse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)