Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it.
url='http://sum.in.ua/?swrd=автор'
page = urllib.request.urlopen(url)
The error itself:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128)
I'm using Ukrainian in API this time, but when I use API (without any Ukrainian letters in it) here:
url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9"
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//p[@class="MsoNormal"]//text()')
it gets me the data in Ukrainian and everything works just fine.
URLs can only use a subset of printable ASCII codepoints; everything else must be properly encoded using URL percent encoding.
You can best achieve that by letting Python handle your parameters. The
urllib.parse.urlencode()
function can convert a dictionary (or a sequence of key-value pairs) for use in URLs:This will first encode the parameters to UTF-8 bytes, then convert those bytes to percent-encoding sequences:
If you did not construct this URL yourself, you'll need to 'repair' the encoding. You can split of the query string, parse it into a dictionary, then pass it to
urlencode
to put it back into the URL usingurllib.parse.urlparse()
andurllib.parse.parse_qs()
:This splits the URL into its constituent parts, parses out the query string, re-encodes and re-builds the URL afterwards:
I believe you can do something like below
I think urllib.quote will transform
"swrd=автор"
into something like"swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80"
which should be accepted just fine