I am trying to make a tool that should get every link from website.
For example I need to get all questions pages from stackoverflow.
I tried using scrapy.
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['https://stackoverflow.com/questions/']
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
url_lnk = link.url
print (url_lnk)
Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just need to understand what to do.
UPD
The site which I want to observe is https://sevastopol.su/ - this is a local city news website.
The list of all news should be containde here: https://sevastopol.su/all-news
In the bottom of this page you can see page numbers, but if we go to the last page of news we will see that it has number 765 (right now, 19.06.2019) but it shows the last new with a date of 19 June 2018. So the last page shows only the one-year old news. But there are also plenty of news links that are still alive (probably from 2010 year) and can be even found in search page of this site.
So that is why I wanted to know if there can be an access to some global link store of this site.
This is something you might wanna do to get all the links to the different questions asked. However, I thing your script might get 404 error somewhere within the execution as there are millions links to parse.
Run the script just the way it is:
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ["https://stackoverflow.com/questions/"]
def parse(self, response):
for link in response.css('.summary .question-hyperlink::attr(href)').getall():
post_link = response.urljoin(link)
yield {"link":post_link}
next_page = response.css("a[rel='next']::attr(href)").get()
if next_page:
next_page_url = response.urljoin(next_page)
yield scrapy.Request(next_page_url,callback=self.parse)
You should write a regular expression (or a similar search function) that looks for <a>
tags with a specific class (in the case of so: class="question-hyperlink"
) and take the href
attribute from those elements. This will fetch all the links from the current page.
Then you can also search for the page links (at the bottom). Here you see that those links are /questions?sort=active&page=<pagenumber>
where you can change <pagenumber>
with the page you want to scrape. (e.g. make a loop that starts at 1
and goes on until you get a 404 error.
your spider which now yields requests to crawl subsequent pages
from scrapy.spiders import CrawlSpider
from scrapy import Request
from urllib.parse import urljoin
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['https://sevastopol.su/all-news']
def parse(self, response):
# This method is called for every successfully crawled page
# get all pagination links using xpath
for link in response.xpath("//li[contains(@class, 'pager-item')]/a/@href").getall():
# build the absolute url
url = urljoin('https://sevastopol.su/', link)
print(url)
yield Request(url=url, callback=self.parse) # <-- This makes your spider recursiv crawl subsequent pages
note that you don't have to worry about requesting the same url multiple times. Duplicates are dropped by scrapy (default settings).
Next steps:
Configure Scrapy (e.g User Agent, Crawl Delay, ...): https://docs.scrapy.org/en/latest/topics/settings.html
Handle Errors (errback): https://docs.scrapy.org/en/latest/topics/request-response.html
Use Item Piplines to store your URLs etc.: https://docs.scrapy.org/en/latest/topics/item-pipeline.html