How to read UTF 8 encoded file in java with turkis

2020-02-02 01:47发布

问题:

I am trying to read a UTF-8 encoded txt file, which has some turkish characters. Basically I am have written an axis based web service, which reads this file and send the output back as a string. Somehow I am not able to read the characters properly. The code is very simple as mentioned here:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

public class TurkishWebService {

    public String generateTurkishString() throws IOException {
        InputStream isr = this.getClass().getResourceAsStream(
                "/" + "turkish.txt");

        BufferedReader in = new BufferedReader(new InputStreamReader(isr,
                "UTF8"));
        String str;

        while ((str = in.readLine()) != null) {
            System.out.println(str);
        }

        in.close();
        return str;
    }

    public String normalString() {
        System.out.println("webService normal text");
        return "webService normal text";
    }

    public static void main(String args[]) throws IOException {
        new TurkishWebService().generateTurkishString();
    }
}

Here are the contents of turkish.txt, just one line

Assalğçğıİİööşş

I am getting the stdout as

Assal?τ????÷÷??

Please suggest what am I doing wrong here.

回答1:

You appear to be correctly decoding the file data from UTF-8 to UTF-16 strings.

System.out performs transcoding operations from UTF-16 strings to the default JRE character encoding. If this does not match the encoding used by the device receiving the character data is corrupted. So, the console should be set to the default character encoding or data corruption occurs. How this is done is device-dependent.

If you are using a terminal, the Console does a better job of determining the device encoding.

Note: it is better to use the try-with-resources or at least try-finally to close streams; use the standard encoding constants if available.



回答2:

Make sure the console you use to display the output is also encoded in UTF-8. In Eclipse for example, you need to go to Run Configuration > Common to do this.



回答3:

Code looks good. The problem should be in console output that cannot print Turkish. To be sure make a temp test in your program: take the string with Assal?τ????÷÷?? that you read from file and do this

 System.out.println(str.charAt(6) == 'ğ');


标签: java utf-8