Pass values into scrapy callback

2019-07-25 13:19发布

I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.

The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.

I was hoping someone could help me figure out how to pass

  1. a unique id to each callback so it can be used as the filename when saving the file
  2. Pass the url of the originating page so it can be added to the metadata via the Items
  3. Follow the links on the child pages to go another level deeper into the site

Below is my code thus far

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem

class CrawlSpider(scrapy.Spider):
    name = "librarycrawler"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com"
    ]

    rules = (
    Rule(LinkExtractor(),callback='scrape_page', follow=True)
)

def scrape_page(self,response):
    page_soup = BeautifulSoup(response.body,"html.parser")
    ScrapedPageTitle = page_soup.title.get_text()
    item = LibrarycrawlerItem()
    item['title'] =ScrapedPageTitle
    item['file_urls'] = response.url

    yield item

In Settings.py

ITEM_PIPELINES = [
    'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'   

In items.py

import scrapy


class LibrarycrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    Files = scrapy.Field()

1条回答
The star\"
2楼-- · 2019-07-25 13:41

I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.

What you want to do looks like a job for CrawlSpider instead of Spider.

CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.

If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.

for link in soup.find_all("a"):
        item=crawlerItem()
        item['url'] = response.urljoin(link.get('href'))
        request=scrapy.Request(url,callback=self.scrape_page)
        request.meta['item']=item
        yield request

To get the item just go look for it in the response:

def scrape_page(self, response):
    item=response.meta['item']

In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url

Also,

It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!

查看更多
登录 后发表回答