Usage scenario
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
Note: We use restlet, not tomcat
Original Problem
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
I.e. for a query part looking like
if "ISO-8859-1" is selected, the sent query part looks like
but if "UTF-8" is selected, the sent query part looks like
Desired Solution
As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status
Current Solution In Detail
Check for each character ( == string.substring(i,i+1) )
- if character.getBytes()[0] equals 63 for '?'
- if Character.getType(character.charAt(0)) returns OTHER_SYMBOL
protected List< String > getNonUnicodeCharacters( String s ) {
final List< String > result = new ArrayList< String >();
for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
final String character = s.substring( i , i + 1 );
final boolean isOtherSymbol =
( int ) Character.OTHER_SYMBOL
== Character.getType( character.charAt( 0 ) );
final boolean isNonUnicode = isOtherSymbol
&& character.getBytes()[ 0 ] == ( byte ) 63;
if ( isNonUnicode )
result.add( character );
return result;
Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?
Note: I checked URLDecoder with the following code
final String[] test = new String[]{
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
System.out.println([i],"UTF-8") );
System.out.println([i],"ISO-8859-1") );
This prints:
and it does not throw an IllegalArgumentException sigh
Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)
One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.
Replace all control chars into empty string
URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:
So you should probably try it. Note also (from the decode() method documentation):
so there's something else to think about!
EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.
You can use a CharsetDecoder configured to throw an exception if invalid chars are found:
See CodingErrorAction.REPORT
This is what I used to check the encoding:
Edit: updated with Vouze suggestion
You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.