From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is faster.
So I'm wondering what are the advantages of one over the other? When would I want to use lxml and when would I be better off using BeautifulSoup? Are there any other libraries worth considering?
In summary,
lxml
is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes asoupparser
module to fall back on BeautifulSoup's functionality.BeautifulSoup
is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.lxml documentation says that both parsers have advantages and disadvantages. For this reason,
lxml
provides asoupparser
so you can switch back and forth. Quoting,In the end they are saying,
If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas
lxml
is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies toBeautifulSoup
itself, not just to thesoupparser
forlxml
.They also show how to benefit from
BeautifulSoup
's encoding detection, while still parsing quickly withlxml
:(Same source: http://lxml.de/elementsoup.html).
In words of
BeautifulSoup
's creator,Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,
And, from Why lxml?,
A somewhat outdated speed comparison can be found here, which clearly recommends lxml, as the speed differences seem drastic.
I've used lxml with great success for parsing HTML. It seems to do a good job of handling "soupy" HTML, too. I'd highly recommend it.
Here's a quick test I had lying around to try handling of some ugly HTML:
For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.Quoting from the linked page:
Pyquery
provides the jQuery selector interface to Python (using lxml under the hood).http://pypi.python.org/pypi/pyquery
It's really awesome, I don't use anything else anymore.
Don't use BeautifulSoup, use lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.