Java : How to determine the correct charset encodi

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct charset encoding of an inputstream/file ?

I have tried using the following:

File in =  new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());

But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.

标签： java file encoding stream character-encoding

15条回答

人间绝色

2楼-- · 2018-12-31 04:50

If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

0人赞添加讨论(0) 举报

浮光初槿花落

3楼-- · 2018-12-31 04:52

check this out: http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStream could be simple like this:

BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

if (cm != null) {
   reader = cm.getReader();
   charset = cm.getName();
}else {
   throw new UnsupportedCharsetException()
}

0人赞添加讨论(0) 举报

若你有天会懂

4楼-- · 2018-12-31 04:53

I found a nice third party library which can detect actual encoding: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

I didn't test it extensively but it seems to work.

0人赞添加讨论(0) 举报

看风景的人

5楼-- · 2018-12-31 04:55

I have used this library, similar to jchardet for detecting encoding in Java: http://code.google.com/p/juniversalchardet/

0人赞添加讨论(0) 举报

浮光初槿花落

6楼-- · 2018-12-31 04:59

Here are my favorites:

TikaEncodingDetector

Dependency:

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

Sample:

public static Charset guessCharset(InputStream is) throws IOException {
  return Charset.forName(new TikaEncodingDetector().guessEncoding(is));    
}

GuessEncoding

Dependency:

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

Sample:

  public static Charset guessCharset2(File file) throws IOException {
    return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
  }

0人赞添加讨论(0) 举报

何处买醉

7楼-- · 2018-12-31 05:00

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text

0人赞添加讨论(0) 举报

1 2 3 下一页

Java : How to determine the correct charset encodi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间