Html parsing with Beautiful Soup returns empty lis

2019-02-19 18:19发布

I have now idea why this piece of code, does not work with this particular site. In other cases it works ok.

    url = "http://www.i-apteka.pl/search.php?node=443&counter=all"
    content = requests.get(url).text
    soup = BeautifulSoup(content)

    links = soup.find_all("a", class_="n63009_prod_link")
    print links

In this case it prints "[]", but there are obviously some links. Any idea?:)

2条回答
对你真心纯属浪费
2楼-- · 2019-02-19 18:48

I had the same problem where locally the Beautiful Soup was working and on my ubuntu Server was returning an empty list all the time. I've tried many parsers following the link [1] and tried many dependencies

Finally what worked for me was:

  • remove the beautiful soap installation
  • remove all its dependencies (pointed by the apt-get install python-bs4)
  • installing it again using the commands bellow

commands:

sudo apt-get install python-bs4

pip install beautifulsoup4

and I'm using the the following code:

soup = BeautifulSoup(my_html_content, 'html.parser')

[http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser][1]

查看更多
不美不萌又怎样
3楼-- · 2019-02-19 18:53

You've found a bug in whichever parser you're using.

I don't know which parser you're using but I do know this:

Python 2.7.2 (from Apple), BS 4.1.3 (from pip), libxml2 2.9.0 (from Homebrew), lxml 3.1.0 (from pip) gets the exact same error as you. Everything else I try—including the same things as above except libxml2 2.7.8 (from Apple)—works. And lxml is the default (at least as of 4.1.3) that BS will try first if you don't specify anything else. And I've seen other unexpected bugs with libxml2 2.9.0 (most of which have been fixed on trunk, but no 2.9.1 has been released yet).

So, if this is your problem, you may want to downgrade to 2.8.0 and/or build it from top of tree.

But if not… it definitely works for me with 2.7.2 with the stdlib html.parser, and in chat you tested the same think with 2.7.1. While html.parser (especially before 2.7.3) is slow and brittle, it seems to be good enough for you. So, the simplest solution is to do this:

soup = BeautifulSoup(content, 'html.parser')

… instead of just letting it pick its favorite parser.

For more info, see Specifying the parser to use (and the sections right above and below).

查看更多
登录 后发表回答