(I am asking this question (and answering it), to make accessible some (hopefully useful) information, since I could not find this readily using search engines. However, feel free to answer it and add useful information :-).)
How can HTTP headers be escaped/quoted in Python?
And/Or how can they be validated to make sure they do not contain any context-escaping values?
In other words, how can we do for HTTP headers, what cgi.escape
and urllib.quote
methods (and sanitizing) do for HTML and URLs? This can be used to guard against HTTP header injection and similar exploits.
For example...
We have the user providing a URL to which one should be redirected. We want to protect against injection attacks (of which SQL injection is a well known one). Setting aside (for this discussion) security concerns (concerning surreptitious automatic forwarding to a URL in a domain which the user can choose), if we decide to redirect using the Location:
header, how can we escape the user-supplied URL to prevent HTTP-header injection (or detect if it contains values dangerous for HTTP)?
# on a "posix sh"-like command-line...
# ...(it contains a malicious HTTP value)
$ redirect_to 'http://example.com'"\r\n"'Set-Cookie: malicious=value'
Now, in our python code implementing the redirect_to
command, we want to input like the above to either be escaped (rendering it harmless), or to be an error. How can we do so?
Don't escape. Just stop processing (drop the header or the whole request).
If the input data is being included in a header field parameter (for example the filename
parameter of the Content-Disposition
header), it can be encoded with email.utils.encode_rfc2231
(as constrained by these specifications, which define a variation of the rfc2231 encoding).
If it is not being included a header field parameter, then it seems that this method cannot be used. In such a situation, the safest bet would likely be to just not include the input, as Julian Reschke wrote; however, if you insist on including the input, you may want to try one of the following methods:
(which may be insecure, since HTTP is not a MIME-compliant protocol, and so unless the MIME-Version
header is used (and possibly even if it is used?), these ways may not work correctly for HTTP.)
One way...
to do this, although it may not be totally foolproof (edit: it is not foolproof (when used by itself); it accepts \r\n\r\n
, which terminates headers and starts the body! Therefore \r
and \n
would need to handled, unless preceded by non-\r
/\n
whitespace (like tabs or spaces.)), is to use the email.header
module. This is designed specifically for rfc822 headers (edit: but (seemingly, since the email package used to be several separate modules (example)) not for HTTP headers!), so would seem to be the tool for the job. This Header
class is meant for encoding header values, not the full Header-Name: value
, and so is a candidate for this job (where we want to vaidate or escape the value only).
(Hint: many of the tools in the email
module are also handy when working with other MIME-format (edit: and possibly also MIME-like) stuff; so too stuff in the cgi
module, cgi.FieldStorage
in particular for HTTP-form parsing.)
However, email.header
only will raise an error if the input seems malicious (seems to contain another (embedded) header); however, it will not, it seems, handle invalid input by escaping it (please correct this in the comments if it is not so). (The charset
parameter should escape the header-fragment, returning valid input, however, it may not have such good compatibility with user agents (email, HTTP, etc.); see here (edit: many HTTP user agents support (not necessarily the charset
parameter of the encoding for the email.header.Header
class (which seems to use some MIME-specific encodings besides rfc2231 encoding), but) the rfc5987 encoding).
Example:
import email.header
import re
def check_string_for_rfc822_header(s):
wip_header_component = str(email.header.Header(s))
if re.search(r'(\r?\n[\S\n\r]|\r[\S\r])', wip_header_component):
raise Exception
else:
return wip_header_component
# testing...
>>> check_string_for_rfc822_header("aaa")
"aaa"
>>> check_string_for_rfc822_header("a\r\nb")
"a\r\nb"
>>> check_string_for_rfc822_header("a\r\nb: c")
<error>
Another way...
to do this, it seems, would be to simply remove \r
and \n
characters (each separately however; do not just remove occurences the full string \r\n
, since this would still leave these unescaped when occuring separately, and many (most?) HTTP utils will accept each of them separately!). Similarly, we can escape the header by replacing \r\n
, \r
, and \n
, with themselves prepended by whitespace (which is the way to escape header; see the standard).
However, this method does not take into account the details of the standards (for example, rfc822 headers must be ACSII), which may be exploitable on their own.
Example:
def remove_linebreakers(s):
return s.replace("\n", "").replace("\r", "")
# or...
import re
def remove_linebreakers(s):
re.sub(r'[\n\r]', '', s)
# testing...
>>> remove_linebreakers("aaa")
"aaa"
>>> remove_linebreakers("a\r\nb")
"ab"
>>> remove_linebreakers("a\r\nb: c")
"ab: c"
In summary...
the first way seems better, but only for validation (not for escaping), unless it is a parameter value, in which case escape it using email.utils.encode_rfc2231
.
Example:
# if we are not working with a header param value, the following...
# ...raises email.errors.HeaderParseError if input is poisonous when in a header
wip_header_component = str(email.header.Header('<input>'))
header_component = (raise_error() if re.search(r'(\r?\n[\S\n\r]|\r[\S\r])', wip_header_component) else wip_header_component)
# ...or if we *are* working with a header param value...
email.utils.encode_rfc2231('<input>', 'UTF-8')