What is the deal about https when using lxml?

2019-06-15 07:51发布

I am using lxml to parse html files given urls.

For example:

link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)

My code is working well for most of the cases, the ones with http://. However, I found for every https:// url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?

BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.

2条回答
Juvenile、少年°
2楼-- · 2019-06-15 08:20

I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with urllib2, though:

from lxml import html
from urllib2 import urlopen

html.parse(urlopen('https://duckduckgo.com'))
查看更多
啃猪蹄的小仙女
3楼-- · 2019-06-15 08:20

From the lxml documentation:

lxml can parse from a local file, an HTTP URL or an FTP URL

I don't see HTTPS in that sentence anywhere, so I assume it is not supported.

An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as urllib2, and pass the retrieved document as a string to lxml.

查看更多
登录 后发表回答