Question: What am I missing in my code (see "Current Code" section below) that would enable me to download files from both absolute and relative paths using Scrapy? I appreciate the help. I'm feeling lost on how all of these components work together and how to get the desired behavior.
Background: I've used a combination of poring over the Scrapy docs, finding comparable examples on GitHub, and trawling StackOverflow for answers, but I can't get the Scrapy files pipeline to work in the way I would like it to. I am looking at fairly basic target websites that have a number of files, primarily PDFs and JPGs, that are linked as absolute or relative paths under the a href
or img src
selectors. I want to download all of those files. My understanding is that response.follow will follow both relative and absolute paths, but I'm not sure whether that function will always yield a path that can be downloaded via the files pipeline. I figured out crawling absolute paths and relative paths, which I figured out thanks to answers offered to my earlier question.
Problems Experienced: There are two primary problems. First, I can't seem to get the spider to follow both absolute and relative paths. Second, I can't seem to get files pipeline to actually download files. This is most likely a function of me not understanding how the four .py files work together. If someone could offer some basic observations and guidance, I'm sure I can move past this basic go/no-go point and start layering in some more sophisticated functionality.
Current Code: here are the relevant contents from myspider.py, items.py, pipelines.py, and settings.py.
myspider.py: note, the parse_items function is not complete, but I do not understand what the function should include.
from scrapy import Spider
from ..items import MyspiderItem
# Using response.follow for different xpaths
class MySpider(Spider):
name='myspider'
allowed_domains=['example.com']
start_urls=['http://www.example.com/']
# Standard link extractor
def parse_all(self, response):
# follow <a href> selector
for href in response.xpath('//a/@href'):
yield response.follow(href, self.parse_items)
# follow <img src> selector
for img in response.xpath('//img/@src'):
yield response.follow(img, self.parse_items)
# This is where I get lost
def parse_items(self, response):
# trying to define item for items pipeline
MyspiderItem.item['file_urls']=[]
items.py
import scrapy
class MyspiderItem(scrapy.Item):
file_urls=scrapy.Field()
files=scrapy.Field()
settings.py: here is the relevant section enabling the files pipeline.
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/home/me/Scraping/myspider/Downloads'
pipelines.py:
class MyspiderPipeline(object):
def process_item(self, item, spider):
return item
I think that your spider myspider.py wrong something!
def parse_all()
might wrong name, since you not defineddef start_requests()
in your spider and point it callback to yourparse_all()
, scrapy will only understand theparse()
by default!I think you should change your name of
parse_all()
toparse()
For issue on absolute/relative path. There is a trick that you can notice the asset path of the site. If your link include that path (might in form of
http://domain/...
) it should be absolute link. With relative path, you can manual append the asset path to them and process your download!Another trick to detect if the link can be a file to download that
the file usually contain the extension, e.g. .pdf, .jpg ...