Getting a raw string from a unicode string in pyth

2020-03-31 06:23发布

问题:

I have a Unicode string I'm retrieving from a web service in python.

I need to access a URL I've parsed from this string, that includes various diacritics.

However, if I pass the unicode string to urlllib2, it produces a unicode encoding error. The exact same string, as a "raw" string r"some string" works properly.

How can I get the raw binary representation of a unicode string in python, without converting it to the system locale?

I've been through the python docs, and every thing seems to come back to the codecs module. However, the documentation for the codecs module is sparse at best, and the whole thing seems to be extremely file oriented.


I'm on windows, if it's important.

回答1:

You need to encode the URL from unicode to a bytestring. u'' and r'' produce two different kinds of objects; a unicode string and a bytestring.

You can encode a unicode string to bytecode with the .encode() method, but you need to know what encoding to use. Usually, for URLs, UTF-8 is great, but you do need to escape the bytes to fit the URL scheme as well:

import urlparse, urllib

parts = list(urlparse.urlsplit(url))
parts[2] = urllib.quote(parts[2].encode('utf8'))
url = urlparse.urlunsplit(parts)

The above example is based on an educated guess that the problem you are facing is due to non-ASCII characters in the path part of the URL, but without further details from you it has to remain a guess.

For domain names, you need to apply the IDNA RFC3490 encoding:

parts = list(urlparse.urlsplit(url))
parts[1] = parts[1].encode('idna')
parts = [p.encode('utf8') if isinstance(p, unicode) else p for p in parts]
url = urlparse.urlunsplit(parts)

See the Python Unicode HOWTO for more information. I also strongly recommend you read the Joel on Software Unicode article as a good primer on the subject of encodings.