How to determine if a String contains invalid enco

2019-01-10 11:35发布

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

Note: We use restlet, not tomcat

Original Problem

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

I.e. for a query part looking like

    ...v=abcädef

if "ISO-8859-1" is selected, the sent query part looks like

...v=abc%E4def

but if "UTF-8" is selected, the sent query part looks like

...v=abc%C3%A4def

Desired Solution

As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

Current Solution In Detail

Check for each character ( == string.substring(i,i+1) )

  1. if character.getBytes()[0] equals 63 for '?'
  2. if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

Code

protected List< String > getNonUnicodeCharacters( String s ) {
  final List< String > result = new ArrayList< String >();
  for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
    final String character = s.substring( i , i + 1 );
    final boolean isOtherSymbol = 
      ( int ) Character.OTHER_SYMBOL
       == Character.getType( character.charAt( 0 ) );
    final boolean isNonUnicode = isOtherSymbol 
      && character.getBytes()[ 0 ] == ( byte ) 63;
    if ( isNonUnicode )
      result.add( character );
  }
  return result;
}

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

Note: I checked URLDecoder with the following code

final String[] test = new String[]{
  "v=abc%E4def",
  "v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
    System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
    System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}

This prints:

v=abc?def
v=abcädef
v=abcädef
v=abcädef

and it does not throw an IllegalArgumentException sigh

10条回答
我想做一个坏孩纸
2楼-- · 2019-01-10 11:41

Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)

One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

查看更多
Anthone
3楼-- · 2019-01-10 11:43

Replace all control chars into empty string

value = value.replaceAll("\\p{Cntrl}", "");
查看更多
甜甜的少女心
4楼-- · 2019-01-10 11:48

URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:

There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.

So you should probably try it. Note also (from the decode() method documentation):

The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites

so there's something else to think about!

EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.

查看更多
女痞
5楼-- · 2019-01-10 11:55

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

 CharsetDecoder UTF8Decoder =
      Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);

See CodingErrorAction.REPORT

查看更多
手持菜刀,她持情操
6楼-- · 2019-01-10 11:56

This is what I used to check the encoding:

CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);

CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
    result.isUnderflow() || result.isMalformed() ||
    result.isUnmappable())
{
    System.out.println("Cannot decode EBCDIC");
}
else
{
    CoderResult result = ebcdicDecoder.flush(out);
    if (result.isOverflow())
       System.out.println("Cannot decode EBCDIC");
    if (result.isUnderflow())
        System.out.println("Ebcdic decoded succefully ");
}

Edit: updated with Vouze suggestion

查看更多
家丑人穷心不美
7楼-- · 2019-01-10 11:56

You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.

查看更多
登录 后发表回答