I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.
i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.
The problem in this line of code:
HTML = lxml.html.fromstring(htmltext)
Maybe someone know what it can be, or hoe to fix this?
Thanks for help.
P.S.
Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
UP:
i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.
And now in with some time they start writing in error log:
"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".
And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.
How can i catch this exception to kill process so daemon can create new one??
It seems the issue stems from the library lxml relies on: libxml2 which is written in C language. Here is the first report: http://codespeak.net/pipermail/lxml-dev/2010-December/005784.html This bug hasn't been mentioned either in lxml v2.3 bug fix logs or in libxml2 change logs.
Oh, there is followup mails here: https://bugs.launchpad.net/lxml/+bug/728924
Well, I tried to reproduce the issue, but get nothing abnormal. Guys who can reproduce it may help to clarify the problem.
There is an excellent article at http://www.lshift.net/blog/2008/11/14/tracing-python-memory-leaks which demonstrates graphical debugging of memory structures; this might help you figure out what's not being released and why.
Edit: I found the article from which I got that link - Python memory leaks
You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:
(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))