How to determine if a String contains invalid enco

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

Note: We use restlet, not tomcat

Original Problem

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

I.e. for a query part looking like

    ...v=abcädef

if "ISO-8859-1" is selected, the sent query part looks like

...v=abc%E4def

but if "UTF-8" is selected, the sent query part looks like

...v=abc%C3%A4def

Desired Solution

As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

Current Solution In Detail

Check for each character ( == string.substring(i,i+1) )

if character.getBytes()[0] equals 63 for '?'
if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

Code

protected List< String > getNonUnicodeCharacters( String s ) {
  final List< String > result = new ArrayList< String >();
  for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
    final String character = s.substring( i , i + 1 );
    final boolean isOtherSymbol = 
      ( int ) Character.OTHER_SYMBOL
       == Character.getType( character.charAt( 0 ) );
    final boolean isNonUnicode = isOtherSymbol 
      && character.getBytes()[ 0 ] == ( byte ) 63;
    if ( isNonUnicode )
      result.add( character );
  }
  return result;
}

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

Note: I checked URLDecoder with the following code

final String[] test = new String[]{
  "v=abc%E4def",
  "v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
    System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
    System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}

This prints:

v=abc?def
v=abcädef
v=abcädef
v=abcÃ¤def

and it does not throw an IllegalArgumentException sigh

标签： java string unicode encoding

10条回答

我想做一个坏孩纸

2楼-- · 2019-01-10 11:41

Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)

One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

0人赞添加讨论(0) 举报

Anthone

3楼-- · 2019-01-10 11:43

Replace all control chars into empty string

value = value.replaceAll("\\p{Cntrl}", "");

0人赞添加讨论(0) 举报

甜甜的少女心

4楼-- · 2019-01-10 11:48

URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:

There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.

So you should probably try it. Note also (from the decode() method documentation):

The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites

so there's something else to think about!

EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.

0人赞添加讨论(0) 举报

女痞

5楼-- · 2019-01-10 11:55

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

 CharsetDecoder UTF8Decoder =
      Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);

See CodingErrorAction.REPORT

0人赞添加讨论(0) 举报

手持菜刀，她持情操

6楼-- · 2019-01-10 11:56

This is what I used to check the encoding:

CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);

CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
    result.isUnderflow() || result.isMalformed() ||
    result.isUnmappable())
{
    System.out.println("Cannot decode EBCDIC");
}
else
{
    CoderResult result = ebcdicDecoder.flush(out);
    if (result.isOverflow())
       System.out.println("Cannot decode EBCDIC");
    if (result.isUnderflow())
        System.out.println("Ebcdic decoded succefully ");
}

Edit: updated with Vouze suggestion

0人赞添加讨论(0) 举报

家丑人穷心不美

7楼-- · 2019-01-10 11:56

You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.

0人赞添加讨论(0) 举报

1 2 下一页

How to determine if a String contains invalid enco

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间