The RFC2617 says to encode the username and password to base64 but don't say what character encoding to use when creating the octets for input into the base64 algorithm.
Should I assume US-ASCII or UTF8? Or has someone settled this question somewhere already?
Original spec - RFC 2617
RFC 2617 can be read as "ISO-8859-1" or "undefined". Your choice. It's known that many servers use ISO-8859-1 (like it or not) and will fail when you send something else. So probably the only safe choice is to stick to ASCII.
For more information and a proposal to fix the situation, see the draft "An Encoding Parameter for HTTP Basic Authentication" (which formed the basis for RFC 7617).
New - RFC 7617
Since 2015 there is RFC 7617, which obsoletes RFC 2617. In contrast to the old RFC, the new RFC explicitly defines the character encoding to be used for username and password.
charset="UTF-8"
in its challenge, like this:WWW-Authenticate: Basic realm="myChosenRealm", charset="UTF-8"
This announces that the server will accept non-ASCII characters in username / password, and that it expects them to be encoded in UTF-8 (specifically Normalization Form C). Note that only UTF-8 is allowed.
Complete version:
Read the spec. If contains additional details, such as the exact encoding procedure, and the list of Unicode codepoints that should be supported.
Browser support
As of 2018, modern browsers will usually default to UTF-8 if a user enters non-ASCII characters for username or password (even if the server does not use the
charset
parameter).Realm
The realm parameter still only supports ASCII characters even in RFC 7617.
RFCs aside, in Spring framework, the
BasicAuthenticationFilter
class, the default is UTF-8.The reason for this choice I believe is that UTF-8 is capable of encoding all possible characters, while ISO-8859-1 (or ASCII) is not. Trying to use username/password with characters not supported in the system can lead to broken behaviour or (maybe worse) degraded security.
Short answer: iso-8859-1 unless encoded-words are used in accordance with RFC2047 (MIME).
Longer explanation:
RFC2617, section 2 (HTTP Authentication) defines basic-credentials:
The spec should not be read without referring to RFC2616 (HTTP 1.1) for definitions in BNF (like the one above):
RFC2616, section 2.1 defines TEXT (emphasis mine):
So it's definitely iso-8859-1 unless you detect some other encoding according to RFC2047 (MIME pt. 3) rules:
In this case the euro sign in the word would be encoded as
0xA4
according to iso-8859-15. It is my understanding that you should check for these encoded word delimiters, and then decode the words inside based on the specified encoding. If you don't, you will think the password is=?iso-8859-15?q?T¤ST?=
(notice that0xA4
would be decoded to¤
when interpreted as iso-8859-1).This is my understanding, I can't find more explicit confirmation than these RFCs. And some of it seems contradictory. For example, one of the 4 stated goals of RFC2047 (MIME, pt. 3) is to redefine:
But then RFC2616 (HTTP 1.1) defines a header using the TEXT rule which defaults to iso-8859-1. Does that mean that every word in this header should be an encoded-word (i.e. the
=?...?=
form)?Also relevant, no current browser does this. They use utf-8 (Chrome, Opera), iso-8859-1 (Safari), the system code page (IE) or something else (like only the most significant bit from utf-8 in the case of Firefox).
Edit: I just realized this answer looks at the issue more from the server-side perspective.
If you are interested in what browsers do when you enter non-ascii characters at the login prompt, I just tried with Firefox.
It seems to lazily convert everithing to ISO-8859-1 by taking the least significant byte of each unicode value, e.g.:
Are encoded the same as:
0x5a 0x3a 0x4e base64-> WjpO