Scrapy: enabling files pipeline for absolute and r

2019-06-10 16:54发布

Question: What am I missing in my code (see "Current Code" section below) that would enable me to download files from both absolute and relative paths using Scrapy? I appreciate the help. I'm feeling lost on how all of these components work together and how to get the desired behavior.

Background: I've used a combination of poring over the Scrapy docs, finding comparable examples on GitHub, and trawling StackOverflow for answers, but I can't get the Scrapy files pipeline to work in the way I would like it to. I am looking at fairly basic target websites that have a number of files, primarily PDFs and JPGs, that are linked as absolute or relative paths under the a href or img src selectors. I want to download all of those files. My understanding is that response.follow will follow both relative and absolute paths, but I'm not sure whether that function will always yield a path that can be downloaded via the files pipeline. I figured out crawling absolute paths and relative paths, which I figured out thanks to answers offered to my earlier question.

Problems Experienced: There are two primary problems. First, I can't seem to get the spider to follow both absolute and relative paths. Second, I can't seem to get files pipeline to actually download files. This is most likely a function of me not understanding how the four .py files work together. If someone could offer some basic observations and guidance, I'm sure I can move past this basic go/no-go point and start layering in some more sophisticated functionality.

Current Code: here are the relevant contents from myspider.py, items.py, pipelines.py, and settings.py.

myspider.py: note, the parse_items function is not complete, but I do not understand what the function should include.

from scrapy import Spider
from ..items import MyspiderItem

# Using response.follow for different xpaths
class MySpider(Spider):
    name='myspider'
    allowed_domains=['example.com']
    start_urls=['http://www.example.com/']

    # Standard link extractor           
    def parse_all(self, response):
        # follow <a href> selector
        for href in response.xpath('//a/@href'):
            yield response.follow(href, self.parse_items)

        # follow <img src> selector
        for img in response.xpath('//img/@src'):
            yield response.follow(img, self.parse_items)

    # This is where I get lost
    def parse_items(self, response):
        # trying to define item for items pipeline
        MyspiderItem.item['file_urls']=[]

items.py

import scrapy

class MyspiderItem(scrapy.Item):
    file_urls=scrapy.Field()
    files=scrapy.Field()

settings.py: here is the relevant section enabling the files pipeline.

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/home/me/Scraping/myspider/Downloads'

pipelines.py:

class MyspiderPipeline(object):
    def process_item(self, item, spider):
        return item

1条回答
劳资没心,怎么记你
2楼-- · 2019-06-10 17:26

I think that your spider myspider.py wrong something!

def parse_all() might wrong name, since you not defined def start_requests() in your spider and point it callback to your parse_all(), scrapy will only understand the parse() by default!

I think you should change your name of parse_all() to parse()

For issue on absolute/relative path. There is a trick that you can notice the asset path of the site. If your link include that path (might in form of http://domain/...) it should be absolute link. With relative path, you can manual append the asset path to them and process your download!

Another trick to detect if the link can be a file to download that the file usually contain the extension, e.g. .pdf, .jpg ...

查看更多
登录 后发表回答