With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly
What is the best way to programatically determine the correct charset encoding of an inputstream/file ?
I have tried using the following:
File in = new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());
But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.
In plain Java:
This approach will try the encodings one by one until one works or we run out of them. (BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)
You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.
Which library to use?
As of this writing, they are three libraries that emerge:
I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.
How to tell which one has detected the right charset (or as close as possible)?
It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.
How to score the returned response?
Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.
Is there any sample code?
Here is a full snippet implementing the strategy described in the previous lines.
Improvements: The
guessEncoding
method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.