For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names.
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
'url': 'http://www.example.com/files/product1.pdf'}),
(False,
Failure(...))]
How can I store the files using my custom file names instead?
In the example above, I would want the file name being "product1_0a79c461a4062ac383dc4fade7bc09f1384a3910.pdf" so I keep uniqueness but make the file name visible.
As a starting point, I explored the pipelines.py
of my project without much success.
import scrapy
from scrapy.pipelines.images import FilesPipeline
from scrapy.exceptions import DropItem
class MyFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
return request.meta.get('filename','')
def get_media_requests(self, item, info):
file_url = item['file_url']
meta = {'filename': item['name']}
yield Request(url=file_url, meta=meta)
with the inclusion of this parameter in my settings.py
ITEM_PIPELINES = {
#'scrapy.pipelines.files.FilesPipeline': 300
'io_spider.pipelines.MyFilesPipeline': 200
}
A similar question has been asked but it does target images and not files.
Any help will be appreciated.
Try using this
file_path
method:(note: this is untested code)
file_path
should return the path to your file. In your code,file_path
returnsitem['name']
and that will be your file's path. Note that by defaultfile_path
calculates SHA1 hashes. So your method should be something like this: