Fast and effective way to parse broken HTML?

I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages.

Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the problem is BS is extremely slow.

As I see, modern browsers like Chrome and Firefox parse HTML very quickly and handle broken HTML really well. Like lxml, Chrome's parser is built on top of libxml2 and libxslt, but with more effective broken HTML handling algorithm. I hope there will be standalone repos exported from Chromium so that I can use them, but haven't found anything similar yet.

Does anyone know a good lib or at least a workaround (by utilizing parts of current known parsers)? Thanks a lot!

BeautifulSoup does a really good job making the broken HTML soup beautiful. You can make the parsing faster by letting it use lxml.html under the hood:

If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib.

soup = BeautifulSoup(html, "lxml")

The other optimization might be the SoupStrainer - parsing only a desired part of an HTML document, but I'm not sure if it's applicable in your use case.

You can also speed things up by installing cchardet library:

You can speed up encoding detection significantly by installing the cchardet library.

Documentation reference.

As I see, modern browsers like Chrome and Firefox parse HTML very quickly and handle broken HTML really well.

I understand that this is a huge overhead, but just to add something to your options - you can fire up Chrome via selenium, navigate to the desired address (or open up a local html file) and dump the HTML back from the .page_source:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("url")

# may be a delay or an explicit wait would be needed here

print(driver.page_source)

driver.close()

Also see: