A few days ago, i was asked about this program's output:
public static void main(String[] args) {
// \u0022 is the Unicode escape for double quote (")
System.out.println("a\u0022.length() + \u0022b".length());
}
My first thought was this program should print the a\u0022.length() + \u0022b
length, which is 16
but surprisingly, it printed 2
. I know \u0022
is the unicode for "
but i thought this "
going to be escaped and only represent one "
literal, with no special meaning. And in reality, Java somehow parsed this string as following:
System.out.println("a".length() + "b".length());
I can't wrap my head around this weird behavior, Why Unicode escapes don't behave as normal escape sequences?
Update Apparently, this was one of brain teasers of the Java Puzzlers: Traps, Pitfalls, and Corner Cases book written by Joshua Bloch and Neal Gafter. More specifically, the question was related to Puzzle 14: Escape Rout.
Basically, they're processed at a different point in reading the input - in lexing rather than parsing, if I've got my terminology right. They're not escape sequences in character literals or string literals, they're escape sequences for the whole source file. Any character that's not part of a Unicode escape sequence can be replaced with the Unicode escape sequence. So you can write programs entirely in ASCII, which actually have variable, method and class names which are non-ASCII...
Fundamentally I believe this was a design mistake in Java, as it can cause some very weird effects (e.g. if you have the escape sequence for a line break within a
//
comment...) but it is what it is...This is detailed in section 3.3 of the JLS:
It is just funny that the following works (taken from the reference)
but the following produces a compile error
On the second one, the compiler should reduce the
\
and the"
, put them together as\"
, but it tried it and it doesn't compile (the"
still closes the string).Before the compiler actually translates the source to bytecode, the lexical translation phase will turn the statement:
into:
Hence the result is 2.
Also see this section about lexical translation from the Language Specification: