Html parsing with Beautiful Soup returns empty lis

I have now idea why this piece of code, does not work with this particular site. In other cases it works ok.

    url = "http://www.i-apteka.pl/search.php?node=443&counter=all"
    content = requests.get(url).text
    soup = BeautifulSoup(content)

    links = soup.find_all("a", class_="n63009_prod_link")
    print links

In this case it prints "[]", but there are obviously some links. Any idea?:)

标签： python django parsing beautifulsoup

2条回答

对你真心纯属浪费

2楼-- · 2019-02-19 18:48

I had the same problem where locally the Beautiful Soup was working and on my ubuntu Server was returning an empty list all the time. I've tried many parsers following the link [1] and tried many dependencies

Finally what worked for me was:

remove the beautiful soap installation
remove all its dependencies (pointed by the apt-get install python-bs4)
installing it again using the commands bellow

commands:

sudo apt-get install python-bs4

pip install beautifulsoup4

and I'm using the the following code:

soup = BeautifulSoup(my_html_content, 'html.parser')

[http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser][1]

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2019-02-19 18:53

You've found a bug in whichever parser you're using.

I don't know which parser you're using but I do know this:

Python 2.7.2 (from Apple), BS 4.1.3 (from pip), libxml2 2.9.0 (from Homebrew), lxml 3.1.0 (from pip) gets the exact same error as you. Everything else I try—including the same things as above except libxml2 2.7.8 (from Apple)—works. And lxml is the default (at least as of 4.1.3) that BS will try first if you don't specify anything else. And I've seen other unexpected bugs with libxml2 2.9.0 (most of which have been fixed on trunk, but no 2.9.1 has been released yet).

So, if this is your problem, you may want to downgrade to 2.8.0 and/or build it from top of tree.

But if not… it definitely works for me with 2.7.2 with the stdlib html.parser, and in chat you tested the same think with 2.7.1. While html.parser (especially before 2.7.3) is slow and brittle, it seems to be good enough for you. So, the simplest solution is to do this:

soup = BeautifulSoup(content, 'html.parser')

… instead of just letting it pick its favorite parser.

For more info, see Specifying the parser to use (and the sections right above and below).

0人赞添加讨论(0) 举报

Html parsing with Beautiful Soup returns empty lis

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间