I'm trying to parse a website and get some info with BeautifulSoup.findAll but it doesn't find them all.. I'm using python3
the code is this
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)
for manga in manga_img:
print (manga['href'])
it only prints the half of them...
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the
lxml
parser is not dealing very well with it:The standard library
html.parser
has less trouble with this specific page:Translating that to your specific code sample using
urllib
, you would specify the parser thus: