With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly
What is the best way to programatically determine the correct charset encoding of an inputstream/file ?
I have tried using the following:
File in = new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());
But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.
If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.
check this out: http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStream could be simple like this:
I found a nice third party library which can detect actual encoding: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
I didn't test it extensively but it seems to work.
I have used this library, similar to jchardet for detecting encoding in Java: http://code.google.com/p/juniversalchardet/
Here are my favorites:
TikaEncodingDetector
Dependency:
Sample:
GuessEncoding
Dependency:
Sample:
The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text