From this excellent "UTF-8 all the way through" question, I read about this:
Unfortunately, you should verify every submitted string as being valid
UTF-8 before you try to store it or use it anywhere. PHP's
mb_check_encoding() does the trick, but you have to use it
religiously. There's really no way around this, as malicious clients
can submit data in whatever encoding they want, and I haven't found a
trick to get PHP to do this for you reliably.
Now, I'm still learning the quirks of encoding, and I'd like to know exactly what malicious clients can do to abuse encoding. What can one achieve? Can somebody give an example? Let's say I save the user input into a MySQL database, or I send it through e-mail, how can a user create harm if I do not use the mb_check_encoding
functionality?
how can a user create harm if I do not use the mb_check_encoding functionality?
This is about overlong encodings.
Due to an unfortunate quirk of UTF-8 design, it is possible to make byte sequences that, if parsed with a naïve bit-packing decoder, would result in the same character as a shorter sequence of bytes - including a single ASCII character.
For example the character <
is usually represented as byte 0x3C, but could also be represented using the overlong UTF-8 sequence 0xC0 0xBC (or even more redundant 3- or 4-byte sequences).
If you take this input and handle it in a Unicode-oblivious byte-based tool, then any character processing step being used in that tool may be evaded. The canonical example would be submitting 0x80 0xBC to PHP, which has native byte strings. The typical use of htmlspecialchars
to HTML-encode the character <
would fail here because the expected byte sequence 0x3C is not present. So the output of the script would still include the overlong-encoded <
, and any browser reading that output could potentially read the sequence 0x80 0xBC 0x73 0x63 0x72 0x69 0x70 0x74 as <script
and hey presto! XSS.
Overlongs have been banned since way back and modern browsers no longer permit them. But this was a genuine problem for IE and Opera for a long time, and there's no guarantee every browser is going to get it right in future. And of course this is only one example - any place where a byte-oriented tool processes Unicode strings you've potentially got similar problems. The best approach, therefore, is to remove all overlongs at the earliest input phase.
Seems like this is a complicated attack. Checking the docs for mb_check_encoding
gives note to a "Invalid Encoding Attack". Googling "Invalid Encoding Attack" brings up some interesting results that I will attempt to explain.
When this kind of data is sent to the server it will perform some decoding to interpret the characters being sent over. Now the server will do some security checks to look for the encoded version of some special characters that could be potentially harmful.
When invalid encoding is sent to the server, the server still runs its decoding algorithm and it will evaluate the invalid encoding. This is where the trouble happens because the security checks may not be looking for invalid variants that would still produce harmful characters when run through the decoding algorithm.
Example of an attack requesting a full directory listing on a unix system :
http://host/cgi-bin/bad.cgi?foo=..%c0%9v../bin/ls%20-al|
Here are some links if you would like a more detailed technical explanation of what is going on in the algorithms:
http://www.cgisecurity.com/owasp/html/ch11s03.html#id2862815
http://www.cgisecurity.com/fingerprinting-port-80-attacks-a-look-into-web-server-and-web-application-attack-signatures.html