Unicode escape behavior in Java programs

2020-07-14 10:12发布

A few days ago, i was asked about this program's output:

public static void main(String[] args) {
    // \u0022 is the Unicode escape for double quote (")
    System.out.println("a\u0022.length() + \u0022b".length());
}

My first thought was this program should print the a\u0022.length() + \u0022b length, which is 16 but surprisingly, it printed 2. I know \u0022 is the unicode for " but i thought this " going to be escaped and only represent one " literal, with no special meaning. And in reality, Java somehow parsed this string as following:

System.out.println("a".length() + "b".length());

I can't wrap my head around this weird behavior, Why Unicode escapes don't behave as normal escape sequences?

Update Apparently, this was one of brain teasers of the Java Puzzlers: Traps, Pitfalls, and Corner Cases book written by Joshua Bloch and Neal Gafter. More specifically, the question was related to Puzzle 14: Escape Rout.

标签: java
3条回答
叛逆
2楼-- · 2020-07-14 10:45

Why Unicode escapes doesn't behave as normal escape sequences?

Basically, they're processed at a different point in reading the input - in lexing rather than parsing, if I've got my terminology right. They're not escape sequences in character literals or string literals, they're escape sequences for the whole source file. Any character that's not part of a Unicode escape sequence can be replaced with the Unicode escape sequence. So you can write programs entirely in ASCII, which actually have variable, method and class names which are non-ASCII...

Fundamentally I believe this was a design mistake in Java, as it can cause some very weird effects (e.g. if you have the escape sequence for a line break within a // comment...) but it is what it is...

This is detailed in section 3.3 of the JLS:

A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.

...

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

查看更多
够拽才男人
3楼-- · 2020-07-14 10:50

It is just funny that the following works (taken from the reference)

System.out.println("a\".length() + \"b".length());

but the following produces a compile error

System.out.println("a\\\u0022.length() + \\\u0022b".length());

On the second one, the compiler should reduce the \ and the ", put them together as \", but it tried it and it doesn't compile (the " still closes the string).

查看更多
Animai°情兽
4楼-- · 2020-07-14 10:51

Before the compiler actually translates the source to bytecode, the lexical translation phase will turn the statement:

System.out.println("a\u0022.length() + \u0022b".length());

into:

System.out.println("a".length() + "b".length());

Hence the result is 2.

Also see this section about lexical translation from the Language Specification:

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

  1. A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
查看更多
登录 后发表回答