I've implemented a Pivotal Tracker API module in Python 2.7. The Pivotal Tracker API expects POST data to be an XML document and "application/xml" to be the content type.
My code uses urlib/httplib to post the document as shown:
request = urllib2.Request(self.url, xml_request.toxml('utf-8') if xml_request else None, self.headers)
obj = parse_xml(self.opener.open(request))
This yields an exception when the XML text contains non-ASCII characters:
File "/usr/lib/python2.7/httplib.py", line 951, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 809, in _send_output
msg += message_body
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 89: ordinal not in range(128)
As near as I can see, httplib._send_output is creating an ASCII string for the message payload, presumably because it expects the data to be URL encoded (application/x-www-form-urlencoded). It works fine with application/xml as long as only ASCII characters are used.
Is there a straightforward way to post application/xml data containing non-ASCII characters or am I going to have to jump through hoops (e.g. using Twistd and a custom producer for the POST payload)?
You're mixing Unicode and bytestrings.
To fix it, make sure that
self.headers
content is properly encoded i.e., all keys, values in theheaders
should be bytestrings:Note: character encoding of the headers has nothing to do with a character encoding of a body i.e., xml text can be encoded independently (it is just an octet stream from http message's point of view).
The same goes for
self.url
—if it has theunicode
type; convert it to a bytestring (using 'ascii' character encoding).HTTP message consists of a start-line, "headers", an empty line and possibly a message-body so
self.headers
is used for headers,self.url
is used for start-line (http method goes here) and probably forHost
http header (if client is http/1.1), XML text goes to message body (as binary blob).It is always safe to use ASCII encoding for
self.url
(IDNA can be used for non-ascii domain names—the result is also ASCII).Here's what rfc 7230 says about http headers character encoding:
To convert XML to a bytestring, see
application/xml
encoding condsiderations:There are 3 things to be covered here
The simple solution is to strictly encoding both header and body to utf-8 before sending out.
Check if the
self.url
is unicode. If it is unicode, thenhttplib
will treat the data as unicode.you could force encode self.url to unicode, then httplib will treat all data as unicode
Same as JF Sebastian answer, but I'm adding a new one so the code formatting works (and is more google-able)
Here's what happens if you're trying to tag on to the end of a mechanize form request: