I want to sent a POST request with a file attached, though some of the field names have Unicode characters in them. But they aren't received correctly by the server, as seen below:
>>> # normal, without unicode
>>> resp = requests.post('http://httpbin.org/post', data={'snowman': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'snowman': u'hello'}
>>>
>>> # with unicode, see that the name has become 'null'
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'null': u'hello'}
>>>
>>> # it works without the image
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}).json()['form']
>>> resp
{u'\u2603': u'hello'}
How do I come around this problem?
From the wireshark comments, it looks like python-requests is doing it wrong, but that there might not be a "right answer".
RFC 2388 says
RFC 2047, in turn, says
and goes on to describe "Q" and "B" encoding methods. Using the "Q" (quoted-printable) method, the name would be:
BUT, as RFC 6266 clearly states:
so we're not allowed to do that. (Kudos to @Lukasa for this catch!)
RFC 2388 also says
And RFC 2231 describes a method that looks more like what you're seeing. In it,
That is, if this method is employed (and supported on both ends), the name should be:
Fortunately, RFC 5987 adds an encoding based on RFC 2231 to HTTP headers! (Kudos to @bobince for this find) It says you can (any probably should) include both a RFC 2231-style value and a plain value:
In their example, however, they "dumb down" the plain value for "legacy clients". This isn't really an option for a form-field name, so it seems like the best approach might be to include both
name=
andname*=
versions, where the plain value is (as @bobince describes it) "just sending the bytes, quoted, in the same encoding as the form", like:See also:
Finally, see http://larry.masinter.net/1307multipart-form-data.pdf (also https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909#c8 ), wherein it is recommended to avoid the problem by sticking with ASCII form field names.
Two things here.
It doesn't for me, I get
name*=utf-8''%E2%98%83
.%5Cu2603
is what I would expect from accidentally typing a\u
escape in a non-Unicode string, ie writing'\u2603'
rather than'☃'
as above.As discussed at some length, this is the RFC 2231 form of extended Unicode headers:
RFC 2231 format was previously invalid in HTTP (HTTP is not an mail standard in the RFC 822 family). It has now been brought to HTTP by RFC 5987, but because that is pretty recent almost nothing on the server side supports it.
Definitely
urllib3
should not be relying on it; it should be doing what browsers do and just sending the bytes, quoted, in the same encoding as the form. If it must use the 2231 form it should be in combination, as in section 4.2. eg inurllib3.fields.format_header_param
, instead of:You could say:
However, including the 2231 form at all may still confuse some older servers.
I guess I'm the one to blame for the fact that urllib3 and therefore Requests produces the format it does. When I wrote that code, I had mostly file names in mind, and RFC 2388 section 4.5 suggests the use of that RFC 2231 format there.
With respect to field names, RFC 2388 section 3 refers to RFC 2047, which in turn forbids the use of encoded words in
Content-Disposition
fields. So it seems to me and others that these two standards contradict one another. But perhaps RFC 2338 should take precedence, so perhaps using RFC 2047 encoded words would be more correct.Recently I've been made aware of the fact that the current draft for the HTML 5 standard has a section on the encoding of multipart/form-data. It contradicts several other standards, but nevertheless it might be the future. With regard to field names (not file names) it describes an encoding which turns characters into decimal XML entities, e.g.
☃
for your☃
snowman. However, that encoding should only be applied if the encoding established for the submission does not contain the character in question, which should not be the case in your setup.I've filed an issue for urllib3 to discuss the consequences of this, and probably address them in the implementation.
Rob Starling's answer is very insightful and proves that using non-ASCII characters in field names is a bad idea compatibility-wise (all those RFCs!), but I managed to get python-requests to adhere to the most used (from what I can see) method of handling things.
Inside
site-packages/requests/packages/urllib3/fields.py
, delete this (line ~50):And change the line right underneath it to this:
This makes servers (that I've tested) pick up the field and process it correctly.