I am using lxml to parse html files given urls.
For example:
link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)
My code is working well for most of the cases, the ones with http://
. However, I found for every https://
url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?
BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.
I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with
urllib2
, though:From the
lxml
documentation:I don't see HTTPS in that sentence anywhere, so I assume it is not supported.
An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as
urllib2
, and pass the retrieved document as a string tolxml
.