In ECMA-262, 3rd edition[PDF], under section 7.6 ("Identifiers," page 26), we see the following note:
The dollar sign is intended for use only in mechanically generated code.
That seems reasonable. Many languages commonly used for generating or embedding JavaScript hold a special meaning for $
, and using it in JavaScript identifiers within those languages leads to unexpected behavior.
The "mechanically generated clause" appeared in edition 2. In edition 1, it was not present. As of edition 5, it disappears again without explanation, and it remains absent from the working draft of the 6th edition.
If I had to guess, I'd assume it was originally omitted because the potential pitfalls hadn't been considered, and was then added in the next edition when it became clear that it was causing problems. I can't think of a good reason for removing it again in edition 5, though.
Is there any explanation for the inclusion and subsequent removal of the "mechanically generated clause" from the specification (a "paper trail" from mailing lists, newsgroups, or elsewhere)? I can't find this documented anywhere.
As a side question, can anyone explain the rationale behind including zero-width characters in the edition 6 draft? This seems like it will cause even more trouble, given that you can't see those characters at all, and I can't think of any reason you'd want those characters in an identifier.
Update: The initial inclusion of the "mechanically generated code" note and the inclusion of zero-width characters are explained in codewaggle's answer below. The only thing remaining to be answered is the primary focus of this question, the removal of the "mechanically generated code" note.
Here's a start: Subject: SC22 N2745 - Disposition of Comments Report on DIS 16262 -ECMAScript
It appears that "should only be used for mechanically-generated code" was added because that was the spec for JAVA.
D6) 7.5: DOLLAR SIGN should not be in the identifier list, according to recommendations in TR 10176. 7.5 should refer to the "i18n" specification of ISO/IEC 14652 for definitions of letters and digits.
>>>>>>
Action: Partial acceptance --- ECMAScript follows Java precedent. A comment will add that $ should only be used for mechanically-generated code. <<<<<
If you want to slog through the minutes of past meetings, you can look here:
ecmascript wiki: Notes and Minutes from past meetings
About later changes:
All of this is from the mailing list "es5-discuss -- Discussion of ECMAScript 3.x".
ZWNJ and ZWJ in identifiers (was: Comments on April ES5 final draft standard tc39-2009-025)
John Cowan wrote:
It turns out that Unicode 5.1 has done the heavy lifting: the bad news
is that the lifting is indeed heavy. You want to allow Cf characters
if and only if they actually make a semantic distinction in
contemporary use. That turns out, says Unicode 5.1, to allow only
U+200C and U+200D and then only in certain contexts: the rules involve
knowing the Script and Joining_Type properties of nearby identifier
characters. Details at
http://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters
.
David-Sarah Hopwood replied:
What is the down-side of simply adding U+200C and U+200D to
IdentifierPart without any additional context-sensitive rules?
I think that it is the combined responsibility of input methods and of
programmers to ensure that <ZWNJ>
and <ZWJ>
characters are used
as intended in identifiers; all that a programming language syntax needs to do is to allow them.
Note that the goal of "excluding as many cases as possible where no
visible distinction results" (supposedly for security reasons) is not
really applicable, since ECMAScript does not enforce even NFC
normalization. To not enforce NFC but to add considerable complexity
to the grammar, as UTR #31 suggests, in order to prevent some
potential (but relatively harmless, AFAICS) misuses of <ZWNJ>
and
<ZWJ>
, seems like an inconsistent set of design choices to me.
This one pulls a bunch of discussion together: Last call for consensus on format-control char. issues
There are 15 replies to this, you'll probably want to read through those:
https://mail.mozilla.org/pipermail/es5-discuss/2009-June/thread.html#2832
Allen Wirfs-Brock wrote:
Waldemar's notes from the May F2F don't record any decision on the
issue of <ZWNJ>
and <ZWJ>
in identifiers. However, my personal notes
say that I need to "keep in identifiers and fix grammar" which is also
my recollection of what we decided at the meeting.
The simplest implementation of that decisions is to simply add <ZWNJ>
and <ZWJ>
as alternatives for IdentifierPart. In addition, the text in
section 7.1 that says that format control characters can occur in
identifier presumably needs to be narrowed to say only <ZWNJ>
and
<ZWJ>
.
At about the same time as the F2F David-Sarah made a more
comprehensive proposal (duplicated below) that in addition to
addressing <ZWNJ>
and <ZWJ>
also significantly refines the rules for
<BOM>
including excluding them from strings literals and regular
expressions and making it a syntax error for a <BOM>
to appear within
an identifier.
I'm not a Unicode expert, but my sense is that David-Sarah's proposal
is sound and probably consistent with the original goals of cleaning
up class Cf in the specification. However, his rules for <BOM>
also
seem like they could significantly complicate the lexical analysis
phase of implementations.
My sense from the F2F is that the consensus was more in the direction
of my simple solution above (<ZWNJ>
and <ZWJ>
in identifiers, <BOM>
is
whitespace) rather than David-Sarah's more comprehensive treatment of
<BOM>
.
I need to have a final decision on this so I can update the draft
accordingly. Based upon my recollection of the F2F I'm going to go
with the "simple solution" unless there is apparent consensus
otherwise.
Final thoughts?
The message he replied to, broken into chunks based on the message quoting:
-----Original Message-----
From: es5-discuss-bounces at mozilla.org [mailto:es5-discuss-
bounces at mozilla.org] On Behalf Of David-Sarah Hopwood
Sent: Thursday, May 28, 2009 5:44 PM
To: es5-discuss at mozilla.org
Subject: Grammar for IdentifierName does not allow <ZWNJ>
and <ZWJ>
John Cowan wrote:
David-Sarah Hopwood scripsit:
The omission of format-control characters from <IdentifierName>
appears
to be just an oversight.
-1
Break
Indeed, I had forgotten that we had already discussed this and come to
a different conclusion:
https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002432.html
https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html.
Break
Allowing all of them causes the same kinds of problems as allowing
BOM. Most of them have little visible effect on the surrounding text
(especially Latin-script text) even in fully conformant Unicode
renderers,
never mind renderers that muffle them. The result is that "foobar"
and
"foo<Cf>
bar" look the same but aren't.
Per Unicode 5.1, the only ones that actually affect the natural-
language
meaning of identifiers are U+200C ZWNJ and U+200D ZWJ. These are the
only
ones which should even be considered in ES5 identifiers. UAX #31
(which
is included by reference in Unicode 5.1) specifies narrower conditions
in which ZWNJ and ZWJ are essential; sticking to the conditions is
non-trivial, but minimizes the chance of spoofing.
Given the risks, I'm uncertain whether ZWNJ and ZWJ should be allowed
or not.
Break
Forget trying to minimize identifier spoofing as a security risk. That's
not possible, if Unicode identifiers are to be allowed at all. It is an
inherent characteristic of Unicode that many distinct (even when
normalized)
strings will look the same. It is not at all clear that this is a
genuine
security risk for general programming -- as opposed to situations that
require adversarial code review, which full ECMAScript is a long way
from being able to support.
What is useful to attempt to minimize is the chance of accidentally
typing identifiers that are distinct but look the same, or of seeing an
identifier and being unable to reliably reproduce it. This is a
usability
issue, not a security issue.
For usability, it may indeed be a good approach to allow <ZWNJ>
and
<ZWJ>
but disallow other format-control characters. I am not sufficiently
familiar with the scripts that require these characters to be sure of
that, but it seems reasonable based on their descriptions in the Unicode
standard.
However, the complicated script-dependent rules described in UAX #31 for
restricting the contexts in which <ZWNJ>
and <ZWJ>
can occur, seem quite
over-the-top given the impossibility of preventing spoofing. Again, see
https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html.
Combining the proposal from that post with the changes for <NEL>
,
<ZWSP>
and <BOM>
(since both affect section 7.1), we end up with this.
====
Changes to section 7.2:
- revert the addition of <NEL>
, <ZWSP>
, and <BOM>
to WhiteSpace and
to the table.
Changes to section 7.8.4:
DoubleStringCharacter ::
SourceCharacter but not double-quote " or backslash \ or
LineTerminator
or <BOM>
\ EscapeSequence
LineContinuation
SingleStringCharacter ::
SourceCharacter but not single-quote ' or backslash \ or
LineTerminator
or <BOM>
\ EscapeSequence
LineContinuation
NonEscapeCharacter ::
SourceCharacter but not EscapeCharacter or LineTerminator or <BOM>
The CV of DoubleStringCharacter :: SourceCharacter but not
double-quote " or backslash \ or LineTerminator or <BOM>
is the SourceCharacter character itself
The CV of SingleStringCharacter :: SourceCharacter but not
single-quote ' or backslash \ or LineTerminator or <BOM>
is the SourceCharacter character itself.
The CV of NonEscapeCharacter :: SourceCharacter but not
EscapeCharacter or LineTerminator or <BOM>
is the
SourceCharacter character itself.
Replace section 7.1:
7.1 Unicode Format-Control Characters
The Unicode format-control characters (i.e., the characters in
General Category "Cf" in the Unicode Character Database such as
LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to
control the formatting of a range of text in the absence of
higher-level protocols for this, such as mark-up languages.
<BOM>
is a format-control character used primarily at the start of
a text to mark it as Unicode and to allow detection of the text's
encoding and byte order. <BOM>
characters intended for this purpose
can sometimes also appear after the start of a text, for example as
a result of concatenating files.
In ECMAScript source, <BOM>
characters are ignored if they appear
immediately before or after a token, or within a span of consecutive
WhiteSpace characters (7.2). The lexical grammar does not explicitly
include such ignored <BOM>
characters. It is a syntax error for a
<BOM>
character to appear within a token (that is, if removing the
<BOM>
would result in the preceding and following characters being
part of the same token).
Note that comments are not tokens, and so the above rule allows
<BOM>
characters to appear within comments. It does not allow them
to appear within string literals or regular expression literals (the
escape sequence \uFEFF should be used instead).
It is useful to allow other format-control characters in source text
to facilitate editing and display. Format-control characters other
than <BOM>
may be used within comments, string literals, and
regular expression literals. Two specific format-control characters,
<ZWNJ>
and <ZWJ>
, may also be used in an identifier after the first
character.
Code Unit Value Name Formal name
\u200C Zero width non-joiner <ZWNJ>
\u200D Zero width joiner <ZWJ>
\uFEFF Byte order mark (also called
zero-width non-breaking space) <BOM>
Changes to section 7.6:
[...] This standard specifies specific character additions: The
dollar sign ($) and the underscore (_) are permitted anywhere in
an identifier. <ZWNJ>
and <ZWJ>
are permitted after the first
character.
Changes to section 7.8.5:
RegularExpressionNonTerminator ::
SourceCharacter but not LineTerminator or <BOM>
Changes to Annex A:
- update all productions changed above.
Changes to Annex E:
- add to the entry for section 7.1:
characters are ignored between tokens and in comments,
but are not allowed within tokens (including string and
regular expression literals). <ZWNJ>
and <ZWJ>
are significant
within identifiers rather than being stripped.
delete the entries for sections 7.2 and 15.10.2.12.
(Reverting the additions of <NEL>
, <ZWSP>
, and <BOM>
to the
WhiteSpace production also reverts this for the \s character
class, without any explicit change to section 15.10.2.12.)
--
David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com
es5-discuss mailing list
es5-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es5-discuss
I'm not going to try to pull all this together and give you a succinct answer, maybe someone else will and you can can accept that as the answer, look at this as a starting point.
One last link:
The August 2009 archive has the initial draft and release candidate 1 discussions for ES5.