Java : How to determine the correct charset encodi

2018-12-31 04:37发布

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct charset encoding of an inputstream/file ?

I have tried using the following:

File in =  new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());

But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.

15条回答
人间绝色
2楼-- · 2018-12-31 04:50

If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

查看更多
浮光初槿花落
3楼-- · 2018-12-31 04:52

check this out: http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStream could be simple like this:

BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

if (cm != null) {
   reader = cm.getReader();
   charset = cm.getName();
}else {
   throw new UnsupportedCharsetException()
}
查看更多
若你有天会懂
4楼-- · 2018-12-31 04:53

I found a nice third party library which can detect actual encoding: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

I didn't test it extensively but it seems to work.

查看更多
看风景的人
5楼-- · 2018-12-31 04:55

I have used this library, similar to jchardet for detecting encoding in Java: http://code.google.com/p/juniversalchardet/

查看更多
浮光初槿花落
6楼-- · 2018-12-31 04:59

Here are my favorites:

TikaEncodingDetector

Dependency:

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

Sample:

public static Charset guessCharset(InputStream is) throws IOException {
  return Charset.forName(new TikaEncodingDetector().guessEncoding(is));    
}

GuessEncoding

Dependency:

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

Sample:

  public static Charset guessCharset2(File file) throws IOException {
    return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
  }
查看更多
何处买醉
7楼-- · 2018-12-31 05:00

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text

查看更多
登录 后发表回答