Java has a default character encoding, which it uses in contexts where a character encoding is not explicitly supplied. The documentation for how it chooses that encoding is vague:
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.
That documentation has to be vague because the method the JVM uses is system specific.
Using the default character encoding is often a bad idea; it is better to use an explicitly indicated encoding, or to always use the same encoding for some I/O. But one unavoidable use of the default character encoding would seem to be the character encoding used for command-line arguments. On a POSIX system such as Linux, the native (C/C++) code of the JVM gets the command-line arguments as a null terminated list of C/C++ char
pointers. Which ought to be thought of as byte pointers, as they must be encoding code points in some (unclear) manner. The JVM has to interpret those sequences of C/C++ char
s (bytes) to convert them into a sequence of Java char
s, to be given to the main()
of the Java program. I assume the JVM uses the default character encoding for this.
So I need to know precisely how the JVM determines the default encoding for a particular system (a modern GNU/Linux operating system), so I can provide user documentation about how my program behaves, and so users of my program can predict how it will behave.
I guess the JVM examines some environment variables, but which ones?