I am using BeautifulSoup
to parse a bunch of possibly very dirty HTML
documents. I stumbled upon a very bizarre thing.
The HTML comes from this page: http://www.wvdnr.gov/
It contains multiple errors, like multiple <html></html>
, <title>
outside the <head>
, etc...
However, html5lib usually works well even in these cases. In fact, when I do:
soup = BeautifulSoup(document, "html5lib")
and I pretti-print soup
, I see the following output: http://pastebin.com/8BKapx88
which contains a lot of <a>
tags.
However, when I do soup.find_all("a")
I get an empty list. With lxml
I get the same.
So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib
found but isn't returning with find_all
?
When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
html.parser
worked for me:
from bs4 import BeautifulSoup
import requests
document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147
See also:
- Differences between parsers.
Even if the correct answer is "use another parser" (thanks @alecxe), I have another workaround. For some reason, this works too:
soup = BeautifulSoup(document, "html5lib")
soup = BeautifulSoup(soup.prettify(), "html5lib")
print soup.find_all('a')
which returns the same link list of:
soup = BeautifulSoup(document, "html.parser")