URL with national characters giving UnicodeEncodeE

2019-08-13 17:44发布

I'm trying to extract dictionary entry:

url = 'http://www.lingvo.ua/uk/Interpret/uk-ru/вікно'
# parsed_url = urlparse(url)
# parameters = parse_qs(parsed_url.query)
# url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text = xmldata.xpath(//div[@class="js-article-html g-card"])

either with commented lines on or off, it keeps getting an error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-28: ordinal not in range(128)

标签： python character-encoding

1条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-08-13 18:30

Your issue is that you have non-ASCII characters within your URL path which must be properly encoded using urllib.parse.quote(string) in Python 3 or urllib.quote(string) in Python 2.

# Python 3
import urllib.parse
url = 'http://www.lingvo.ua' + urllib.parse.quote('/uk/Interpret/uk-ru/вікно')

# Python 2
import urllib
url = 'http://www.lingvo.ua' + urllib.quote(u'/uk/Interpret/uk-ru/вікно'.encode('UTF-8'))

NOTE: According to What is the proper way to URL encode Unicode characters?, URLs should be encoded as UTF-8. However, that does not preclude percent encoding the resulting non-ASCII, UTF-8 characters.

0人赞添加讨论(0) 举报

URL with national characters giving UnicodeEncodeE

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间