What is the deal about https when using lxml?

2019-06-15 07:51发布

I am using lxml to parse html files given urls.

For example:

link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)

My code is working well for most of the cases, the ones with http://. However, I found for every https:// url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?

BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.

标签： python parsing lxml

2条回答

Juvenile、少年°

2楼-- · 2019-06-15 08:20

I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with urllib2, though:

from lxml import html
from urllib2 import urlopen

html.parse(urlopen('https://duckduckgo.com'))

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2019-06-15 08:20

From the lxml documentation:

lxml can parse from a local file, an HTTP URL or an FTP URL

I don't see HTTPS in that sentence anywhere, so I assume it is not supported.

An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as urllib2, and pass the retrieved document as a string to lxml.

0人赞添加讨论(0) 举报

What is the deal about https when using lxml?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间