Java Text File Encoding

2019-01-25 09:03发布

问题:

I have a text file and it can be ANSI (with ISO-8859-2 charset), UTF-8, UCS-2 Big or Little Endian.

Is there any way to detect the encoding of the file to read it properly?

Or is it possible to read a file without giving the encoding? (and it reads the file as it is)

(There are several program that can detect and convert encoding/format of text files.)

回答1:

UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.



回答2:

Yes, there's a number of methods to do character encoding detection, specifically in Java. Take a look at jchardet which is based on the Mozilla algorithm. There's also cpdetector and a project by IBM called ICU4j. I'd take a look at the latter, as it seems to be more reliable than the other two. They work based on statistical analysis of the binary file, ICU4j will also provide a confidence level of the character encoding it detects so you can use this in the case above. It works pretty well.



回答3:

You can use ICU4J (http://icu-project.org/apiref/icu4j/)

Here is my code:

            String charset = "ISO-8859-1"; //Default chartset, put whatever you want

            byte[] fileContent = null;
            FileInputStream fin = null;

            //create FileInputStream object
            fin = new FileInputStream(file.getPath());

            /*
             * Create byte array large enough to hold the content of the file.
             * Use File.length to determine size of the file in bytes.
             */
            fileContent = new byte[(int) file.length()];

            /*
             * To read content of the file in byte array, use
             * int read(byte[] byteArray) method of java FileInputStream class.
             *
             */
            fin.read(fileContent);

            byte[] data =  fileContent;

            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);

            CharsetMatch cm = detector.detect();

            if (cm != null) {
                int confidence = cm.getConfidence();
                System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                //Here you have the encode name and the confidence
                //In my case if the confidence is > 50 I return the encode, else I return the default value
                if (confidence > 50) {
                    charset = cm.getName();
                }
            }

Remember to put all the try catch need it.

I hope this works for you.



回答4:

If your text file is a properly created Unicode text file then the Byte Order Mark (BOM) should tell you all the information you need. See here for more details about BOM

If it's not then you'll have to use some encoding detection library.