Reading UTF-8 characters using Scanner

2019-07-21 14:32发布

问题:

public boolean isValid(String username, String password)  {
        boolean valid = false;
        DataInputStream file = null;

        try{
            Scanner files = new Scanner(new BufferedReader(new FileReader("files/students.txt")));

            while(files.hasNext()){
                System.out.println(files.next());
            }

        }catch(Exception e){
            e.printStackTrace();
        }
        return valid;
    }

How come when I am reading a file that has been written by UTF-8(By another java program) it displays with weird symbols followed by its String name?

I wrote it using this

    private static void  addAccount(String username,String password){
        File file = new File(file_name);
        try{
            DataOutputStream dos = new DataOutputStream(new FileOutputStream(file,true));
            dos.writeUTF((username+"::"+password+"\n"));
        }catch(Exception e){

        }
    } 

回答1:

Here is a simple way to do that:

File words = new File(path);
Scanner s = new Scanner(words,"utf-8");


回答2:

From the FileReader Javadoc:

Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

So perhaps something like new InputStreamReader(new FileInputStream(file), "UTF-8"))



回答3:

When using DataOutput.writeUTF/DataInput.readUTF, the first 2 bytes form an unsigned 16-bit big-endian integer denoting the size of the string.

First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read. These bytes are then converted to characters by considering them in groups. The length of each group is computed from the value of the first byte of the group. The byte following a group, if any, is the first byte of the next group.

These are likely the cause for your issues. You'd need to skip the first 2 bytes and then specify your Scanner use UTF-8 to read properly.

That being said, I do not see any reason to use DataOutput/DataInput here. You can merely use FileReader and FileWriter instead. These will use the default system encoding.



标签: java utf-8 io