I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986.
I get from the user a URL in UTF-8. So if they've typed in http://➡.ws/♥
I get 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
in Python. And what I want out is the ASCII version: 'http://xn--hgi.ws/%E2%99%A5'
.
What I do at the moment is split the URL up into parts via a regex, and then manually IDNA-encode the domain, and separately encode the path and query string with different urllib.quote()
calls.
# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
raise BadURLException(url)
protocol, domain, port, path, query = match.groups()
try:
domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
return '' # bad UTF-8 chars in domain
domain = domain.encode('idna')
if port is None:
port = ''
path = urllib.quote(path)
if query is None:
query = ''
else:
query = urllib.quote(query, safe='=&?/')
url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'
Is this correct? Any better suggestions? Is there a simple standard-library function to do this?
Code:
Output:
Read more:
Edits:
urlparse
/urlunparse
tourlsplit
/urlunsplit
.Okay, with these comments and some bug-fixing in my own code (it didn't handle fragments at all), I've come up with the following
canonurl()
function -- returns a canonical, ASCII form of the URL:there's some RFC-3896 url parsing work underway (e.g. as part of the Summer Of Code) but nothing in the standard library yet AFAIK -- and nothing much on the uri encoding side of things either, again AFAIK. So you might as well go with MizardX's elegant approach.
the code given by MizardX isnt 100% correct. This example wont work:
example.com/folder/?page=2
check out django.utils.encoding.iri_to_uri() to convert unicode URL to ASCII urls.
http://docs.djangoproject.com/en/dev/ref/unicode/
You might use
urlparse.urlsplit
instead, but otherwise you seem to have a very straightforward solution, there.(You can access the domain and port separately by accessing the returned value's named properties, but as port syntax is always in ASCII it is unaffected by the IDNA encoding process.)