After studying HTTP/1.1 standard, specifically page 31 and related I came to conclusion that any 8-bit octet can be present in HTTP header value. I.e. any character with code from [0,255] range.
And yet HTTP servers I tried refuse to take anything with code > 127 (or most US-ASCII non-printable chars).
Here is dried out excerpt of grammar used in standard:
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value and consisting of
either *TEXT or combinations of token, separators, and
quoted-string>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT )
OCTET = <any 8-bit sequence of data>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
TEXT = <any OCTET except CTLs, but including LWS>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\"
| <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext = <any TEXT except <">>
quoted-pair = "\" CHAR
As you can see field-content
can be a quoted-string
, which is an enquoted sequence of TEXT
(i.e. any 8-bit octet with exception of "
and values from [0-8, 11-12, 14-31, 127]
range) or quoted-pair
(\
followed by any value from [0, 127]
range). I.e. any 8-bit char sequence can be passed by en-quoting it and prefixing special symbols with \
).
(Note that standard doesn't treat NUL(0x00)
char in any special way)
But, obviously either all servers I tried are not conforming or standard has changed since 1999 or I can't read it properly.
So... which characters are allowed in HTTP header values and why?
P.S. Reason behind all of this: I am looking for a way to pass utf-8-encoded sequence in HTTP header value (without additional encoding, if possible).
It looks as if there is an error in the HTTP/1.1 specs. As you pointed out, §4.2 describes the field content as OCTET:
And OCTET is defined in §2.2 as:
These lines are the basis of your conclusion that octets > 127 should be allowed, and certainly I see how you have drawn that conclusion. The mention of OCTET in §4.2 is the misleading error; it should be CHAR.
If you read §4.2 (Message Headers) from the beginning, you will note the following guidance:
If we do as instructed and go to RFC 822, specifically §3.1.2 (Structure of header fields), we learn the following:
So while HTTP/1.1 was written in 1999, they used a definition from 1982 to describe the field contents. In 1982, characters 0-127 were called "ASCII" and 128-255 were called "Extended ASCII". Now, in this answer I am not going to get involved in the food fight that gets evoked when using the term "Extended ASCII". I will simply point you to §3.3 of RFC 822 for the definition of what was then considered "any ASCII character":
And so there you have it - the smoking gun. "ASCII" stopped at 127 in 1982. The written paragraph portion of RFC 2616 §4.2 points you in the right direction, and the unfortunate later misuse of the token OCTET in that same section led you down this rabbit hole.
RFC 2616 is obsolete (see https://www.rfc-editor.org/info/rfc2616), the relevant part has been replaced by RFC 7230 (see https://www.greenbytes.de/tech/webdav/rfc7230.html#rfc.section.A.2.p.9):
In essence, RFC 2616 defaulted to ISO-8859-1, and this was both insufficient and not interoperable anyway. Thus, RFC 7230 has deprecated non-ASCII octets in field values. The recommendation is to use an escaping mechanism on top of that (such as defined in RFC 8187, or plain URI-percent-encoding).