Scrapy i/o block when downloading files

2019-05-13 21:51发布

I using Scrapy to scrapy a webside and download some files. Since the file_url I get will redirect to another url (302 redirect).So I use another method handle_redirect to get the redirected url. I custom the filespipeline like this.

class MyFilesPipeline(FilesPipeline):

    def handle_redirect(self, file_url):
        response = requests.head(file_url)
        if response.status_code == 302:
            file_url = response.headers["Location"]
        return file_url

    def get_media_requests(self, item, info):
        redirect_url = self.handle_redirect(item["file_urls"][0])
        yield scrapy.Request(redirect_url)

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no images")
        item['file_urls'] = file_paths
        return item

With the code above, I can download the file, but the process were block when downloading, so the whole project become very slow.

I tried another solution in spider, use Requests get redirected url first then pass to another function.and use the default filespipeline.

yield scrapy.Request(
            download_url[0],
            meta={
                "name": name,
                },
            dont_filter=True,
            callback=self.handle_redirect)

    def handle_redirect(self, response):
        logging.warning("respon %s" % response.meta)
        download_url = response.headers["Location"].decode("utf-8")

        return AppListItem(
            name=response.meta["name"],
            file_urls=[download_url],
            )

Still block the process.

From the dos here

Using the Files Pipeline

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason)

Is this mean I can't scrapy next url until the file had been downloaded? (I don't set download_delay in my settings)

EDIT

I already added this at first:

handle_httpstatus_list = [302]

so I will not be redirect to the redirected url, My first solution use requests is because I think yield will work like this:

  1. I crawl a page, keep yield callback, then call return item
  2. The item pass to the pipeline and If it meet some i/o, it will yield to spider to crawl next page like normal Asynchronous i/o do.

Or I have to wait for the files downloaded so I can crawl next page? Is it the downside of Scrapy? The second part I don't follow is how to calculate the speed of crawl the page.For instance, 3s for a complete page, with a default concurrency of 16.I guess @neverlastn use 16/2/3 to get 2.5 pages/s.Doesn't concurrency 16 means I can handle 16 request at the same time? So the speed should be 16 pages/s? Please correct if I wrong.

Edit2

Thanks for your answer, I understand how to calculate now,But I still don't understand the second part.On 302 I first meet this problem. Error 302 Downloading File in Scrapy I have a url like

http://example.com/first

which will use 302 and redirect to

http://example.com/second

but Scrapy don't auto redirect to the second one, and can not download the file which is wired. From the code here Scrapy-redirect and dos here RedirectMiddleware points out that scrapy should handle redirect by default. That is why I do some trick and trying to fix it.My third solution will try to use Celery like this

class MyFilesPipeline(FilesPipeline):
    @app.task
    def handle_redirect(self, file_url):
        response = requests.head(file_url)
        if response.status_code == 302:
            file_url = response.headers["Location"]
        return file_url

    def get_media_requests(self, item, info):
        redirect_url = self.handle_redirect.delay(item["file_urls"][0])
        yield scrapy.Request(redirect_url)

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no images")
        item['file_urls'] = file_paths
        return item

Since I have lot of spider already, I don't want to override them use the second solution. So I handle them in the pipeline,Is this solution will be better?

标签: python scrapy
2条回答
做自己的国王
2楼-- · 2019-05-13 22:05

You use the requests API which is synchronous/blocking. This means that you turn your concurrency (CONCURRENT_REQUESTS_PER_DOMAIN) from (by default) 8, to effectively one. It seems like it dominates your delay. Nice trick the one you did on your second attempt. This doesn't use requests thus it should be faster compared to using requests (isn't it?) Now, of course you add extra delay... If your first (HTML) request takes 1s and the second (image) request 2s, overall you have 3s for a complete page. With a default concurrency of 16, this would mean that you would crawl about 2.5 pages/s. When your redirect fails and you don't crawl the image, the process would take aprox. 1s i.e. 8 pages/s. So you might see a 3x slowdown. One solution might be to 3x the number of concurrent requests you allow to run in parallel by increasing CONCURRENT_REQUESTS_PER_DOMAIN and/or CONCURRENT_REQUESTS. If you are now running this from a place with limited bandwidth and/or increased latency, another solution might be to run it from a cloud server closer to the area where the image servers are hosted (e.g. EC2 US East).

EDIT

The performance is better understood by "little's law". 1st both CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS work typically in parallel. CONCURRENT_REQUESTS_PER_DOMAIN = 8 by default and I would guess that you typically download from a single domain thus your actual concurrency limit is 8. The level of concurrency (i.e. 8) isn't per second but it's a fixed number, like saying that "that oven can bake at most 8 croissants in it". How quickly your croissants bake is the latency (this is the web response time) and the metric you're interested in is their ratio i.e. 8 croissants can bake in parallel / 3 second per croissant = I will be baking 2.5 croissants/second.

enter image description here

On 302, I'm not sure what exactly you're trying to do. I think you're just following them - it's just that you do it manually. I think that scrapy will do this for you when extending the allowed codes. FilesPipeline might not get the value from handle_httpstatus_list but the global setting HTTPERROR_ALLOWED_CODES should affect the FilesPipeline as well.

Anyway, requests is a bad option because it blocks = definitely very bad performance. yielding Scrapy Requests will "get them out of the way" (for now) but you will "meet them" again because they use the same resource, the scheduler and the downloader to do the actual downloads. This means that they will highly likely slow down your performance... and this is a good thing. I understand that your need here is to crawl fast, but scrapy wants you to be conscious of what you're doing and when you set a concurrency limit of e.g. 8 or 16, you trust scrapy to not "flood" your target sites with higher than that rate. Scrapy will take the pessimistic assumption that your media files served by the same server/domain are traffic to their web server (instead of some CDN) and will apply the same limits in order to protect the target site and you. Otherwise, imagine a page that happens to have 1000 images in it. If you get those 1000 downloads somehow "for free", you will be doing 8000 requests to the server in parallel, with concurrency set to 8 - not a good thing.

If you want to get some downloads "for free" i.e. ones that don't adhere to the concurrency limits, you can use treq. This is the requests package for the Twisted framework. Here's how to use it in a pipeline. I would feel more comfortable using it for hitting API's or web servers I own, rather than 3rd party servers.

查看更多
太酷不给撩
3楼-- · 2019-05-13 22:14

WARNING: there is much better hacks free solution

Add this to settings: MEDIA_ALLOW_REDIRECTS = True

https://doc.scrapy.org/en/latest/topics/media-pipeline.html#allowing-redirections

Take note, that in item_completed results you will get old not redirected URL. file_path also getting not redirected request. So file name will be calculated from not redirected data. If you want to add redirection info you probably should implement your own media_to_download method in file pipeline and include response.meta to results, as it should contain redirection info:

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect

查看更多
登录 后发表回答