Memory Leak in Scrapy

2020-07-14 16:11发布

问题:

i wrote the following code to scrape for email addresses (for testing purposes):

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from crawler.items import EmailItem

class LinkExtractorSpider(CrawlSpider):
    name = 'emailextractor'
    start_urls = ['http://news.google.com']

    rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),)

    def process_item(self, response):
        refer = response.url
        items = list()
        for email in Selector(response).re("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}"):

            emailitem = EmailItem()
            emailitem['email'] = email
            emailitem['refer'] = refer
            items.append(emailitem)
        return items

Unfortunately, it seems that references to the Requests are not closed properly, as with the scrapy telnet console, the number of Requests increases by 5k/s. After ~3min and 10k scraped pages, my system starts swapping (8GB RAM). Anyone got an idea what is wrong? I already tried to remove the refer and "copied" the string using

emailitem['email'] = ''.join(email)

without success. After scraping, the items get saved into a BerkeleyDB counting their occurrences (using pipelines), so the references should be gone after that.

What would be the difference between returning a set of items and yielding each item separately?

EDIT:

After quite a while of debugging I found out, that the Requests are not freed, such that I end up with:

$> nc localhost 6023
>>> prefs()
Live References
Request 10344   oldest: 536s ago
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('Request')
>>> r.url
<GET http://news.google.com>

which is in fact the start url. Anybody knows what the problem is? Where is the missing reference to the Request object?

EDIT2:

After running for ~12 hours on a server (having 64GB RAM), the RAM used is ~16GB (using ps, even if ps is not the right tool for it). The problem is, that the number of crawled pages is going significantly down and the number of scraped items remains 0 since hours:

INFO: Crawled 122902 pages (at 82 pages/min), scraped 3354 items (at 0 items/min)

EDIT3: I did the objgraph analysis which results in the following graph (thanks @Artur Gaspar):

It does not seem that I can influence it?

回答1:

The final answer for me was the use of a disk-based queue in conjunction with a working directory as runtime parameter.

This is adding the following code to the settings.py:

DEPTH_PRIORITY = 1 
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

afterwards, starting the crawler using the following commandline makes the changes persistent in the given directory:

scrapy crawl {spidername} -s JOBDIR=crawls/{spidername} see scrapy docs for details

The addidtional benefit of this approach is, that the crawl can be paused and resumed at any time. My spider now runs for more than 11 days blocking ~15GB memory (file cache memory for disk FIFO queues)



回答2:

If you yield each item separately, the code is executed differently by the Python interpreter: it's not a function anymore, but a generator.

This way, the full list is never created, and each item will have its memory allocated one at the time, when the code that is using the generator asks for the next item.

So, it could be that you don't have a memory leak, you just have a lot of memory being allocated, roughly 10k pages time the memory used by a list for one page.

Of course you still could have a real memory leak though, there are tips for debugging leaks in Scrapy here.