Suppose I have:
<a href="http://www.yahoo.com/" target="_yahoo"
title="Yahoo!™" onclick="return gateway(this);">Yahoo!</a>
<script type="text/javascript">
function gateway(lnk) {
window.open(SERVLET +
'?external_link=' + encodeURIComponent(lnk.href) +
'&external_target=' + encodeURIComponent(lnk.target) +
'&external_title=' + encodeURIComponent(lnk.title));
return false;
}
</script>
I have confirmed external_title
gets encoded as Yahoo!%E2%84%A2
and passed to SERVLET
. If in SERVLET
I do:
Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));
I get Yahoo!â„¢ in the browser. If I manually switch the browser character encoding to UTF-8, it changes to Yahoo!TM (which is what I want).
So I figured the encoding I was sending to the browser was wrong (it was Content-type: text/html; charset=ISO-8859-1
). I changed SERVLET
to:
response.setContentType("text/html; charset=utf-8");
Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));
Now the browser character encoding is UTF-8, but it outputs Yahoo!⢠and I can't get the browser to render the correct character at all.
My question is: is there some combination of Content-type
and/or new String(request.getParameter("external_title").getBytes(), "UTF-8");
and/or something else that will result in Yahoo!TM appearing in the SERVLET
output?
You are nearly there. EncodeURIComponent correctly encodes to UTF-8, which is what you should always use in a URL today.
The problem is that the submitted query string is getting mutilated on the way into your server-side script, because getParameter() uses ISO-8559-1 instead of UTF-8. This stems from Ancient Times before the web settled on UTF-8 for URI/IRI, but it's rather pathetic that the Servlet spec hasn't been updated to match reality, or at least provide a reliable, supported option for it.
(There is request.setCharacterEncoding in Servlet 2.3, but it doesn't affect query string parsing, and if a single parameter has been read before, possibly by some other framework element, it won't work at all.)
So you need to futz around with container-specific methods to get proper UTF-8, often involving stuff in server.xml. This totally sucks for distributing web apps that should work anywhere. For Tomcat see http://wiki.apache.org/tomcat/FAQ/CharacterEncoding and also What's the difference between "URIEncoding" of Tomcat, Encoding Filter and request.setCharacterEncoding.
There is a bug in certain versions of Jetty that makes it parse higher number UTF-8 characters incorrectly. If your server accepts arabic letters correctly but not emoji, that's a sign you have a version with this problem, since arabic is not in ISO-8859-1, but is in the lower range of UTF-8 characters ("lower" meaning java will represent it in a single char).
I updated from version 7.2.0.v20101020 to version 7.5.4.v20111024 and this fixed the problem; I can now use the getParameter(String) method instead of having to parse it myself.
If you're really curious, you can dig into your version of org.eclipse.jetty.util.Utf8StringBuilder.append(byte) and see whether it correctly adds multiple chars to the string when the utf-8 code is high enough or if, as in 7.2.0, it simply casts an int to a char and appends.
You could always use javascript to manipulate the text further.
I got the same problem and solved it by decoding
Request.getQueryString()
using URLDecoder(), and after extracting my parameters.I suspect that the data mutilation happens in the request, i.e. the declared encoding of the request does not match the one that is actually used for the data.
What does
request.getCharacterEncoding()
return?I don't really know how JavaScript handles encodings or how to make it use a specific one.
You need to make sure that encodings are used correctly at all stages - do NOT try to "fix" the data by using
new String()
angetBytes()
at a point where it has already been encoded incorrectly.Edit: It may help to have the origin page (the one with the Javascript) also encoded in UTF-8 and declared as such in its Content-Type. Then I believe Javascript may default to using UTF-8 for its request - but this is not definite knowledge, just guesswork.
There is way to do it in java (no fiddling with
server.xml
)Do not work :
Works:Worked but will break if default encoding != utf-8 - try this instead (omit the call to decode() it's not needed):
As I said above if the
server.xml
is messed with as in :(notice the
URIEncoding="UTF-8"
) the code above will break (cause thegetBytes("iso-8859-1")
should readgetBytes("UTF-8")
). So for a bullet proof solution you have to get the value of theURIEncoding
attribute. This unfortunately seems to be container specific - even worse container version specific. For tomcat 7 you'd need something like :And still you need to tweak this for multiple connectors (check the commented out parts). Then you would use something like :
Still this may fail (IIUC) if
parameter = request.getParameter("name");
decoded with CHARSET_FOR_URI_ENCODING was corrupted so the bytes I get with getBytes() were not the original ones (that's why "iso-8859-1" is used by default - it will preserve the bytes). You can get rid of it all by manually parsing the query string in the lines of:I am still looking for the place in the docs where it is mentioned that
request.getParameter("name")
does callURLDecoder.decode()
instead of returning the%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7
string ? A link in the source would be much appreciated.Also how can I pass as the parameter's value the string, say,=> see comment :%CE
?parameter=%25CE