I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.
The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.
I was hoping someone could help me figure out how to pass
- a unique id to each callback so it can be used as the filename when saving the file
- Pass the url of the originating page so it can be added to the metadata via the Items
- Follow the links on the child pages to go another level deeper into the site
Below is my code thus far
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem
class CrawlSpider(scrapy.Spider):
name = "librarycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
In Settings.py
ITEM_PIPELINES = [
'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'
In items.py
import scrapy
class LibrarycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
Files = scrapy.Field()
I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.
What you want to do looks like a job for
CrawlSpider
instead ofSpider
.CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.
If you are stubborn enough to keep
Spider
you can use the meta tag on requests to pass the items and save links in them.To get the item just go look for it in the response:
In this specific example the item passed
item['url']
is obsolete as you can get the current url withresponse.url
Also,
It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!