multipart/form-data, what is the default charset f

2019-04-21 15:25发布

what is the default encoding one should use to decode multipart/form-data if no charset is given? RFC2388 states:

4.5 Charset of text in form data

Each part of a multipart/form-data is supposed to have a content- type. In the case where a field element is text, the charset parameter for the text indicates the character encoding used.

For example, a form with a text field in which a user typed 'Joe owes <eu>100' where <eu> is the Euro symbol might have form data returned as:

--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable>>

Joe owes =80100.
--AaB03x

In my case, the charset isn't set and I don't know how to decode the data within that text/plain section. As I do not want to enforce something that isn't standard behavior I'm asking what the expected behavior in this case is. The RFC does not seem to explain this so I'm kinda lost.

Thank you!

3条回答
Rolldiameter
2楼-- · 2019-04-21 15:55

The default charset for HTTP 1.1 is ISO-8859-1 (Latin1), I would guess that this also applies here.

3.7.1 Canonicalization and Text Defaults

--snip--

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

查看更多
Animai°情兽
3楼-- · 2019-04-21 16:07

Thanks to the detailed explanation by @owlman.

Just some more info here:

Upload request payload fragment:

------WebKitFormBoundarydZAwJIasnBbGaUqM
Content-Disposition: form-data; name="file"; filename="xxx.txt"
Content-Type: text/plain

If "xxx.txt" has some UNICODE char in it using UTF-8 encoding, Resin(as of 4.0.40) can't decode it correctly, but Jetty(9.x) can.

I think the reason for Resin's behavior is that the Content-type doesn't specify any encoding, so Resin decode file name using "ISO8859-1", which may result in garbled characters.

I did some googling:

https://mail-archives.apache.org/mod_mbox/struts-user/200310.mbox/%3C3FA0395B.1080209@kumachan.net.nz%3E

It seems that Resin's behavior is according to Servlet Spec 2.3

And I can't find any settings from http://www.caucho.com/resin-4.0/reference.xtp which can change this behavior for Resin.

查看更多
Anthone
4楼-- · 2019-04-21 16:17

This apparently has changed in HTML5 (see http://dev.w3.org/html5/spec-preview/constraints.html#multipart-form-data).

The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified.

So where is the character set specified? As far as I can tell from the encoding algorithm, the only place is within a form data set entry named _charset_.

If your form does not have a hidden input named _charset_, what happens? I've tested this in Chrome 28, sending a form encoded in UTF-8 and one in ISO-8859-1 and inspecting the sent headers and payload, and I don't see charset given anywhere (even though the text encoding definitely changes). If I include an empty _charset_ field in the form, Chrome populates that with the correct charset type. I guess any server-side code must look for that _charset_ field to figure it out?

I ran into this problem while writing a Chrome extension that uses XMLHttpRequest.send of a FormData object, which always gets encoded in UTF-8 no matter what the source document encoding is.

Let the request entity body be the result of running the multipart/form-data encoding algorithm with data as form data set and with utf-8 as the explicit character encoding.

Let mime type be the concatenation of "multipart/form-data;", a U+0020 SPACE character, "boundary=", and the multipart/form-data boundary string generated by the multipart/form-data encoding algorithm.

As I found earlier, charset=utf-8 is not specified anywhere in the POST request, unless you include an empty _charset_ field in the form, which in this case will automatically get populated with "utf-8".

This is my understanding of the state of things. I welcome any corrections to my assumptions!

查看更多
登录 后发表回答