How does the JVM determine the (default?) characte

2020-04-30 13:18发布

问题:

Java has a default character encoding, which it uses in contexts where a character encoding is not explicitly supplied. The documentation for how it chooses that encoding is vague:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.

That documentation has to be vague because the method the JVM uses is system specific.

Using the default character encoding is often a bad idea; it is better to use an explicitly indicated encoding, or to always use the same encoding for some I/O. But one unavoidable use of the default character encoding would seem to be the character encoding used for command-line arguments. On a POSIX system such as Linux, the native (C/C++) code of the JVM gets the command-line arguments as a null terminated list of C/C++ char pointers. Which ought to be thought of as byte pointers, as they must be encoding code points in some (unclear) manner. The JVM has to interpret those sequences of C/C++ chars (bytes) to convert them into a sequence of Java chars, to be given to the main() of the Java program. I assume the JVM uses the default character encoding for this.

So I need to know precisely how the JVM determines the default encoding for a particular system (a modern GNU/Linux operating system), so I can provide user documentation about how my program behaves, and so users of my program can predict how it will behave.

I guess the JVM examines some environment variables, but which ones?

回答1:

You can ofcourse look at the source code of java.nio.charset.Charset.defaultCharset(). When I do that on my system (64-bit Windows 7, with Oracle JDK 8 update 25) I see this:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            String csn = AccessController.doPrivileged(
                new GetPropertyAction("file.encoding"));
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

In other words, it looks at the system property file.encoding and if it cannot find a matching Charset instance, it uses UTF-8.