Parsing HTML in python - lxml or BeautifulSoup? Wh

2019-01-03 05:10发布

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is faster.

So I'm wondering what are the advantages of one over the other? When would I want to use lxml and when would I be better off using BeautifulSoup? Are there any other libraries worth considering?

7条回答
不美不萌又怎样
2楼-- · 2019-01-03 05:39

For sure i would use EHP. It is faster than lxml, much more elegant and simpler to use.

Check out. https://github.com/iogf/ehp

<body ><em > foo  <font color="red" ></font></em></body>


from ehp import *

data = '''<html> <body> <em> Hello world. </em> </body> </html>'''

html = Html()
dom = html.feed(data)

for ind in dom.find('em'):
    print ind.text()    

Output:

Hello world. 
查看更多
登录 后发表回答