An application I'm maintaining loads user agents extracted from web logs into a MySQL table column using the 'latin1' charset. Occasionally, it fails to load a user agent that looks like this:
Mozilla/5.0 (Iâ?; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML^C like Gecko) Version
I suspect it's choking on Iâ?
. I'm working to figure out if this should be supported, or if it's corruption introduced by the upstream logging system. Is this a legal user agent in a HTTP header?
RFC 2616 (HTTP 1.1) says that message header contents must be "consisting of either *TEXT
or combinations of token, separators, and quoted-string". If you look at the definitions for TEXT etc you will find that legal characters are those with byte values not in the [0, 31] range and not equal to 127; therefore characters such as â
are as far as I can tell legal as per the spec.
Technically, octets > 127 are allowed in comments. RFC 2616 makes them default to ISO-8859-1, but HTTPbis (the upcoming revision of RFC 2616) has removed that rule so that sometimes in the distant future, we may be able to move to a sane encoding.
Recommendation: strip all octets > 127.
HTTP 1.1 RFC2616 refers to ISO-8859-1, which is a latin based single byte character set.
With the consideration that HTTP traffic is supposed to be single byte, I also am using the latin1 character set for my similar logs. The decision was simply to make my indexes smaller.
If you use UTF8 with VARCHAR, only the characters that are multi-byte require additional bytes, so in table space, it's not much extra. However, indexes are stored fixed-width, so, they're padded with spaces just in case you need them (UTF8 indexes are three times as large as latin1 indexes).
It doesn't affect me if the occasional odd header is unreadable. However, if you're not indexing the column, you may as well use UTF8.