I want to sent a POST request with a file attached, though some of the field names have Unicode characters in them. But they aren't received correctly by the server, as seen below:
>>> # normal, without unicode
>>> resp = requests.post('http://httpbin.org/post', data={'snowman': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'snowman': u'hello'}
>>>
>>> # with unicode, see that the name has become 'null'
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'null': u'hello'}
>>>
>>> # it works without the image
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}).json()['form']
>>> resp
{u'\u2603': u'hello'}
How do I come around this problem?
From the wireshark comments, it looks like python-requests is doing it wrong, but that there might not be a "right answer".
RFC 2388 says
Field names originally in non-ASCII character sets may be encoded within the value of the "name" parameter using the standard method described in RFC 2047.
RFC 2047, in turn, says
Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between. It specifies a character set and an encoding method, and also includes the original text encoded as graphic ASCII characters, according to the rules for that encoding method.
and goes on to describe "Q" and "B" encoding methods. Using the "Q" (quoted-printable) method, the name would be:
=?utf-8?q?=E2=98=83?=
BUT, as RFC 6266 clearly states:
An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'.
so we're not allowed to do that. (Kudos to @Lukasa for this catch!)
RFC 2388 also says
The original local file name may be supplied as well, either as a
"filename" parameter either of the "content-disposition: form-data"
header or, in the case of multiple files, in a "content-disposition:
file" header of the subpart. The sending application MAY supply a
file name; if the file name of the sender's operating system is not
in US-ASCII, the file name might be approximated, or encoded using
the method of RFC 2231.
And RFC 2231 describes a method that looks more like what you're seeing. In it,
Asterisks ("*") are reused to provide the indicator that language and
character set information is present and encoding is being used. A
single quote ("'") is used to delimit the character set and language
information at the beginning of the parameter value. Percent signs
("%") are used as the encoding flag, which agrees with RFC 2047.
Specifically, an asterisk at the end of a parameter name acts as an
indicator that character set and language information may appear at
the beginning of the parameter value. A single quote is used to
separate the character set, language, and actual value information in
the parameter value string, and an percent sign is used to flag
octets encoded in hexadecimal.
That is, if this method is employed (and supported on both ends), the name should be:
name*=utf-8''%E2%98%83
Fortunately, RFC 5987 adds an encoding based on RFC 2231 to HTTP headers! (Kudos to @bobince for this find) It says you can (any probably should) include both a RFC 2231-style value and a plain value:
Header field specifications need to define whether multiple instances
of parameters with identical parmname components are allowed, and how
they should be processed. This specification suggests that a
parameter using the extended syntax takes precedence. This would
allow producers to use both formats without breaking recipients that
do not understand the extended syntax yet.
Example:
foo: bar; title="EURO exchange rates";
title*=utf-8''%e2%82%ac%20exchange%20rates
In their example, however, they "dumb down" the plain value for "legacy clients". This isn't really an option for a form-field name, so it seems like the best approach might be to include both name=
and name*=
versions, where the plain value is (as @bobince describes it) "just sending the bytes, quoted, in the same encoding as the form", like:
Content-Disposition: form-data; name="☃"; name*=utf-8''%E2%98%83
See also:
- HTTP headers encoding/decoding in Java
- How can I encode a filename according to RFC 2231?
- How to encode the filename parameter of Content-Disposition header in HTTP?
Finally, see http://larry.masinter.net/1307multipart-form-data.pdf (also https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909#c8 ), wherein it is recommended to avoid the problem by sticking with ASCII form field names.
The field value appears as form-data;name*=utf-8''%5Cu2603 in Wireshark
Two things here.
It doesn't for me, I get name*=utf-8''%E2%98%83
. %5Cu2603
is what I would expect from accidentally typing a \u
escape in a non-Unicode string, ie writing '\u2603'
rather than '☃'
as above.
As discussed at some length, this is the RFC 2231 form of extended Unicode headers:
RFC 2231 format was previously invalid in HTTP (HTTP is not an mail standard in the RFC 822 family). It has now been brought to HTTP by RFC 5987, but because that is pretty recent almost nothing on the server side supports it.
Definitely urllib3
should not be relying on it; it should be doing what browsers do and just sending the bytes, quoted, in the same encoding as the form. If it must use the 2231 form it should be in combination, as in section 4.2.
eg in urllib3.fields.format_header_param
, instead of:
value = email.utils.encode_rfc2231(value, 'utf-8')
You could say:
value = '%s="%s"; %s*=%s' % (
name, value, name,
email.utils.encode_rfc2231(value, 'utf-8')
)
However, including the 2231 form at all may still confuse some older servers.
I guess I'm the one to blame for the fact that urllib3 and therefore Requests produces the format it does. When I wrote that code, I had mostly file names in mind, and RFC 2388 section 4.5 suggests the use of that RFC 2231 format there.
With respect to field names, RFC 2388 section 3 refers to RFC 2047, which in turn forbids the use of encoded words in Content-Disposition
fields. So it seems to me and others that these two standards contradict one another. But perhaps RFC 2338 should take precedence, so perhaps using RFC 2047 encoded words would be more correct.
Recently I've been made aware of the fact that the current draft for the HTML 5 standard has a section on the encoding of multipart/form-data. It contradicts several other standards, but nevertheless it might be the future. With regard to field names (not file names) it describes an encoding which turns characters into decimal XML entities, e.g. ☃
for your ☃
snowman. However, that encoding should only be applied if the encoding established for the submission does not contain the character in question, which should not be the case in your setup.
I've filed an issue for urllib3 to discuss the consequences of this, and probably address them in the implementation.
Rob Starling's answer is very insightful and proves that using non-ASCII characters in field names is a bad idea compatibility-wise (all those RFCs!), but I managed to get python-requests to adhere to the most used (from what I can see) method of handling things.
Inside site-packages/requests/packages/urllib3/fields.py
, delete this (line ~50):
value = email.utils.encode_rfc2231(value, 'utf-8')
And change the line right underneath it to this:
value = '%s="%s"' % (name, value.decode('utf-8'))
This makes servers (that I've tested) pick up the field and process it correctly.