I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/
Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit
to split the URL into its components, andurllib.parse.quote
to properly quote/escape the unicode characters andurllib.parse.urlunsplit
to join it back together.For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with
http://bücher.ch
:Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
(Technically this still isn't quite good enough in the general case because
urlparse
doesn't split away anyuser:pass@
prefix or:port
suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normalurllib.quote
and.encode('idna')
at the time you're constructing a URL than to have to pull an IRI apart.)