What if the form-data boundary is contained in the

2020-06-08 13:47发布

问题:

Let's take the following example of multipart/form-data taken from w3.com:

Content-Type: multipart/form-data; boundary=AaB03x

--AaB03x
Content-Disposition: form-data; name="submit-name"

Larry
--AaB03x
Content-Disposition: form-data; name="files"; filename="file1.txt"
Content-Type: text/plain

... contents of file1.txt ...
--AaB03x--

It's pretty straight forward, but let's say you are writing code that implements this and creates such a request from scratch. Let's assume file1.txt is created by a user, and we have no control over its contents.

What if the text file file1.txt contains the string --AaB03x? You likely generated the boundary AaB03x randomly, but let's assume a "million monkeys entering a million web forms" scenario.

Is there a standard way of dealing with this improbably but still possible situation?

Should the text/plain (or even, potentially something like image/jpeg or application/octet-stream) be "encoded" or some of the information within "escaped" in some sort of way?

Or should the developer always search the contents of the file for the boundary, and then repeatedly keep picking a new randomly generated boundary until the chosen string cannot be found within the file?

回答1:

HTTP delegates to the MIME RFCs for defining the multipart/ types here. The rules are laid out in RFC 2046 section 5.1.

The RFC simply states the boundary must not appear:

The boundary delimiter MUST NOT appear inside any of the encapsulated parts, on a line by itself or as the prefix of any line. This implies that it is crucial that the composing agent be able to choose and specify a unique boundary parameter value that does not contain the boundary parameter value of an enclosing multipart as a prefix.

and

NOTE: Because boundary delimiters must not appear in the body parts being encapsulated, a user agent must exercise care to choose a unique boundary parameter value. The boundary parameter value in the example above could have been the result of an algorithm designed to produce boundary delimiters with a very low probability of already existing in the data to be encapsulated without having to prescan the data. Alternate algorithms might result in more "readable" boundary delimiters for a recipient with an old user agent, but would require more attention to the possibility that the boundary delimiter might appear at the beginning of some line in the encapsulated part. The simplest boundary delimiter line possible is something like "---", with a closing boundary delimiter line of "-----".

Most MIME software simply generates a random boundary such that the probability of that boundary appearing in the parts is statistically unlikely; e.g. a collision could happen but the probability of that ever happening is so low as to be infeasible. Computer UUID values rely on the same principles; if you generate a few trillion UUIDs in a year, the probability of generating two identical UUID values is about the same as someone being hit by a meteorite, both have a 1 in 17 billion chance.

Note that you usually encode binary data to some form of ASCII-safe encoding like base64, an encoding that doesn't include dashes, removing the likelihood that binary data ever contains the boundary.

As such, the standard way to deal with the possibility is to simply make the possibility so unlikely as to be next to nothing. If you have a greater chance of a computer storing the email being hit by a meteorite, why worry about the MIME boundary?