I using Scrapy to scrapy a webside and download some files. Since the file_url I get will redirect to another url (302 redirect).So I use another method handle_redirect to get the redirected url. I custom the filespipeline like this.
class MyFilesPipeline(FilesPipeline):
def handle_redirect(self, file_url):
response = requests.head(file_url)
if response.status_code == 302:
file_url = response.headers["Location"]
return file_url
def get_media_requests(self, item, info):
redirect_url = self.handle_redirect(item["file_urls"][0])
yield scrapy.Request(redirect_url)
def item_completed(self, results, item, info):
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem("Item contains no images")
item['file_urls'] = file_paths
return item
With the code above, I can download the file, but the process were block when downloading, so the whole project become very slow.
I tried another solution in spider, use Requests get redirected url first then pass to another function.and use the default filespipeline.
yield scrapy.Request(
download_url[0],
meta={
"name": name,
},
dont_filter=True,
callback=self.handle_redirect)
def handle_redirect(self, response):
logging.warning("respon %s" % response.meta)
download_url = response.headers["Location"].decode("utf-8")
return AppListItem(
name=response.meta["name"],
file_urls=[download_url],
)
Still block the process.
From the dos here
When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason)
Is this mean I can't scrapy next url until the file had been downloaded? (I don't set download_delay in my settings)
EDIT
I already added this at first:
handle_httpstatus_list = [302]
so I will not be redirect to the redirected url, My first solution use requests is because I think yield will work like this:
- I crawl a page, keep yield callback, then call return item
- The item pass to the pipeline and If it meet some i/o, it will yield to spider to crawl next page like normal Asynchronous i/o do.
Or I have to wait for the files downloaded so I can crawl next page? Is it the downside of Scrapy? The second part I don't follow is how to calculate the speed of crawl the page.For instance, 3s for a complete page, with a default concurrency of 16.I guess @neverlastn use 16/2/3 to get 2.5 pages/s.Doesn't concurrency 16 means I can handle 16 request at the same time? So the speed should be 16 pages/s? Please correct if I wrong.
Edit2
Thanks for your answer, I understand how to calculate now,But I still don't understand the second part.On 302 I first meet this problem. Error 302 Downloading File in Scrapy I have a url like
http://example.com/first
which will use 302 and redirect to
http://example.com/second
but Scrapy don't auto redirect to the second one, and can not download the file which is wired. From the code here Scrapy-redirect and dos here RedirectMiddleware points out that scrapy should handle redirect by default. That is why I do some trick and trying to fix it.My third solution will try to use Celery like this
class MyFilesPipeline(FilesPipeline):
@app.task
def handle_redirect(self, file_url):
response = requests.head(file_url)
if response.status_code == 302:
file_url = response.headers["Location"]
return file_url
def get_media_requests(self, item, info):
redirect_url = self.handle_redirect.delay(item["file_urls"][0])
yield scrapy.Request(redirect_url)
def item_completed(self, results, item, info):
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem("Item contains no images")
item['file_urls'] = file_paths
return item
Since I have lot of spider already, I don't want to override them use the second solution. So I handle them in the pipeline,Is this solution will be better?
You use the
requests
API which is synchronous/blocking. This means that you turn your concurrency (CONCURRENT_REQUESTS_PER_DOMAIN
) from (by default) 8, to effectively one. It seems like it dominates your delay. Nice trick the one you did on your second attempt. This doesn't userequests
thus it should be faster compared to usingrequests
(isn't it?) Now, of course you add extra delay... If your first (HTML) request takes 1s and the second (image) request 2s, overall you have 3s for a complete page. With a default concurrency of 16, this would mean that you would crawl about 2.5 pages/s. When your redirect fails and you don't crawl the image, the process would take aprox. 1s i.e. 8 pages/s. So you might see a 3x slowdown. One solution might be to 3x the number of concurrent requests you allow to run in parallel by increasingCONCURRENT_REQUESTS_PER_DOMAIN
and/orCONCURRENT_REQUESTS
. If you are now running this from a place with limited bandwidth and/or increased latency, another solution might be to run it from a cloud server closer to the area where the image servers are hosted (e.g. EC2 US East).EDIT
The performance is better understood by "little's law". 1st both
CONCURRENT_REQUESTS_PER_DOMAIN
andCONCURRENT_REQUESTS
work typically in parallel.CONCURRENT_REQUESTS_PER_DOMAIN
= 8 by default and I would guess that you typically download from a single domain thus your actual concurrency limit is 8. The level of concurrency (i.e. 8) isn't per second but it's a fixed number, like saying that "that oven can bake at most 8 croissants in it". How quickly your croissants bake is the latency (this is the web response time) and the metric you're interested in is their ratio i.e. 8 croissants can bake in parallel / 3 second per croissant = I will be baking 2.5 croissants/second.On 302, I'm not sure what exactly you're trying to do. I think you're just following them - it's just that you do it manually. I think that scrapy will do this for you when extending the allowed codes.
FilesPipeline
might not get the value fromhandle_httpstatus_list
but the global settingHTTPERROR_ALLOWED_CODES
should affect theFilesPipeline
as well.Anyway,
requests
is a bad option because it blocks = definitely very bad performance.yield
ing ScrapyRequest
s will "get them out of the way" (for now) but you will "meet them" again because they use the same resource, the scheduler and the downloader to do the actual downloads. This means that they will highly likely slow down your performance... and this is a good thing. I understand that your need here is to crawl fast, but scrapy wants you to be conscious of what you're doing and when you set a concurrency limit of e.g. 8 or 16, you trust scrapy to not "flood" your target sites with higher than that rate. Scrapy will take the pessimistic assumption that your media files served by the same server/domain are traffic to their web server (instead of some CDN) and will apply the same limits in order to protect the target site and you. Otherwise, imagine a page that happens to have 1000 images in it. If you get those 1000 downloads somehow "for free", you will be doing 8000 requests to the server in parallel, with concurrency set to 8 - not a good thing.If you want to get some downloads "for free" i.e. ones that don't adhere to the concurrency limits, you can use treq. This is the requests package for the Twisted framework. Here's how to use it in a pipeline. I would feel more comfortable using it for hitting API's or web servers I own, rather than 3rd party servers.
WARNING: there is much better hacks free solution
Add this to settings:
MEDIA_ALLOW_REDIRECTS = True
https://doc.scrapy.org/en/latest/topics/media-pipeline.html#allowing-redirections
Take note, that in
item_completed
results
you will get old not redirected URL.file_path
also getting not redirectedrequest
. So file name will be calculated from not redirected data. If you want to add redirection info you probably should implement your ownmedia_to_download
method in file pipeline and includeresponse.meta
toresults
, as it should contain redirection info:https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect