I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.
Some interesting examples:
The heart character. If I type this into my browser:
http://www.google.com/search?q=♥
Then copy and paste it, I see this URL
http://www.google.com/search?q=%E2%99%A5
which makes it seem like Firefox (or Safari) is doing this.
urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'
which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.
…
If I type the URL
http://www.google.com/search?q=…
into my browser then copy and paste, I get
http://www.google.com/search?q=%E2%80%A6
back. Which seems to be the result of doing
urllib.quote_plus(x.encode("utf-8"))
which makes sense since … can't be encoded with Latin-1.
But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.
Since this seems to be ambiguous:
In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'
works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.
What's the right thing to be doing with the special characters I need to deal with?
The first question is what are your needs? UTF-8 encoding is a pretty good compromise between taking text created with a cheap editor and support for a wide variety of languages. In regards to the browser identifying the encoding, the response (from the web server) should tell the browser the encoding. Still most browsers will attempt to guess, because this is either missing or wrong in so many cases. They guess by reading some amount of the result stream to see if there is a character that does not fit in the default encoding. Currently all browser(? I did not check this, but it is pretty close to true) use utf-8 as the default.
So use utf-8 unless you have a compelling reason to use one of the many other encoding schemes.
IRI (RFC 3987) is the latest standard that replaces the URI/URL (RFC 3986 and older) standards. URI/URL do not natively support Unicode (well, RFC 3986 adds provisions for future URI/URL-based protocols to support it, but does not update past RFCs). The "%uXXXX" scheme is a non-standard extension to allow Unicode in some situations, but is not universally implemented by everyone. IRI, on the other hand, fully supports Unicode, and requires that text be encoded as UTF-8 before then being percent-encoded.
IRIs do not replace URIs, because only URIs (effectively, ASCII) are permissible in some contexts -- including HTTP.
Instead, you specify an IRI and it gets transformed into a URI when going out on the wire.
The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from. This is a guess that if the server sends us "text/xml; charset=iso-8859-1", then they expect responses back in the same format.
If you're just entering a URL in the URL bar, then the browser doesn't have a base page to work on and therefore just has to guess. So in this case it seems to be doing utf-8 all the time (since both your inputs produced three-octet form values).
The sad truth is that AFAIK there's no standard for what character set the values in a query string, or indeed any characters in the URL, should be interpreted as. At least in the case of values in the query string, there's no reason to suppose that they necessarily do correspond to characters.
It's a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance, in Tomcat, you have to call request.setEncoding() (or some similar method) before you call any of the request.getParameter() methods. The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers. (I regularly ask Java interviewees what the difference between a Reader and an InputStream is, and regularly get blank looks)
I would always encode in UTF-8. From the Wikipedia page on percent encoding:
It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.