Isn't the size of character in Java 2 bytes?

2019-01-03 12:26发布

I used RandomAccessFile to read a byte from a text file.

public static void readFile(RandomAccessFile fr) {
    byte[] cbuff = new byte[1];
    fr.read(cbuff,0,1);
    System.out.println(new String(cbuff));
}

Why am I seeing one full character being read by this?

标签: java string char
7条回答
狗以群分
2楼-- · 2019-01-03 12:37

Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.

查看更多
劳资没心,怎么记你
3楼-- · 2019-01-03 12:39

In ASCII text file each character is just one byte

查看更多
ら.Afraid
4楼-- · 2019-01-03 12:44

Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.

查看更多
叛逆
5楼-- · 2019-01-03 12:45

There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.

On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing

It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".

I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.

查看更多
看我几分像从前
6楼-- · 2019-01-03 12:52

Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.

Some characters (ASCII) are single byte, but many others are multi-byte.

Java supports Unicode, thus according to:

Java Character Docs

The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).

查看更多
聊天终结者
7楼-- · 2019-01-03 12:59

The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.

It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).

查看更多
登录 后发表回答