Javascript has a tricky grammar to parse. Forward-slashes can mean a number of different things: division operator, regular expression literal, comment introducer, or line-comment introducer. The last two are easy to distinguish: if the slash is followed by a star, it starts a multiline comment. If the slash is followed by another slash, it is a line-comment.
But the rules for disambiguating division and regex literal are escaping me. I can't find it in the ECMAScript standard. There the lexical grammar is explicitly divided into two parts, InputElementDiv and InputElementRegExp, depending on what a slash will mean. But there's nothing explaining when to use which.
And of course the dreaded semicolon insertion rules complicate everything.
Does anyone have an example of clear code for lexing Javascript that has the answer?
It's actually fairly easy, but it requires making your lexer a little smarter than usual.
The division operator must follow an expression, and a regular expression literal can't follow an expression, so in all other cases you can safely assume you're looking at a regular expression literal.
You already have to identify Punctuators as multiple-character strings, if you're doing it right. So look at the previous token, and see if it's any of these:
. ( , { } [ ; , < > <= >= == != === !== + - * % ++ --
<< >> >>> & | ^ ! ~ && || ? : = += -= *= %= <<= >>= >>>=
&= |= ^= / /=
For most of these, you now know you're in a context where you can find a regular expression literal. Now, in the case of ++ --
, you'll need to do some extra work. If the ++
or --
is a pre-increment/decrement, then the /
following it starts a regular expression literal; if it is a post-increment/decrement, then the /
following it starts a DivPunctuator.
Fortunately, you can determine whether it is a "pre-" operator by checking its previous token. First, post-increment/decrement is a restricted production, so if ++
or --
is preceded by a linebreak, then you know it is "pre-". Otherwise, if the previous token is any of the things that can precede a regular expression literal (yay recursion!), then you know it is "pre-". In all other cases, it is "post-".
Of course, the )
punctuator doesn't always indicate the end of an expression - for example if (something) /regex/.exec(x)
. This is tricky because it does require some semantic understanding to disentangle.
Sadly, that's not quite all. There are some operators that are not Punctuators, and other notable keywords to boot. Regular expression literals can also follow these. They are:
new delete void typeof instanceof in do return case throw else
If the IdentifierName you just consumed is one of these, then you're looking at a regular expression literal; otherwise, it's a DivPunctuator.
The above is based on the ECMAScript 5.1 specification (as found here) and does not include any browser-specific extensions to the language. But if you need to support those, then this should provide easy guidelines for determining which sort of context you're in.
Of course, most of the above represent very silly cases for including a regular expression literal. For example, you can't actually pre-increment a regular expression, even though it is syntactically allowed. So most tools can get away with simplifying the regular expression context checking for real-world applications. JSLint's method of checking the preceding character for (,=:[!&|?{};
is probably sufficient. But if you take such a shortcut when developing what's supposed to be a tool for lexing JS, then you should make sure to note that.
I am currently developing a JavaScript/ECMAScript 5.1 parser with JavaCC. RegularExpressionLiteral and Automatic Semicolon Insertion are two things which make me crazy in ECMAScript grammar. This question and an answers were invaluable for the regex question. In this answer I'd like to put my own findings together.
TL;DR In JavaCC, use lexical states and switch them from the parser.
Very important is what Thom Blake wrote:
The division operator must follow an expression, and a regular
expression literal can't follow an expression, so in all other cases
you can safely assume you're looking at a regular expression literal.
So you actually need to understand if it was an expression or not before. This is trivial in the parser but very hard in the lexer.
As Thom pointed out, in many (but, unfortunately, not all) cases you can understand if it was an expression by "looking" at the last token. You have to consider punctuators as well as keywords.
Let's start with keywords. The following keywords cannot precede a DivPunctuator
(for example, you cannot have case /5
), so if you see a /
after these, you have a RegularExpressionLiteral
:
case
delete
do
else
in
instanceof
new
return
throw
typeof
void
Next, punctuators. The following punctuators cannot precede a DivPunctuator
(ex. in { /a...
the symbol /
can never start a division):
{ ( [
. ; , < > <=
>= == != === !==
+ - * %
<< >> >>> & | ^
! ~ && || ? :
= += -= *= %= <<=
>>= >>>= &= |= ^=
/=
So if you have one of these and see /...
after this, then this can never be a DivPunctuator
and therefore must be a RegularExpressionLiteral
.
Next, if you have:
/
And /...
after that it also must be a RegularExpressionLiteral
. If there were no space between these slashes (i.e. // ...
), this must have handled as a SingleLineComment
("maximal munch").
Next, the following punctuator may only end an expression:
]
So the following /
must start a DivPunctuator
.
Now we have the following remaining cases which are, unfortunately, ambiguous:
}
)
++
--
For }
and )
you have to know if they end an expression or not, for ++
and --
- they end an PostfixExpression
or start an UnaryExpression
.
And I have come to the conclusion that it is very hard (if not impossible) to find out in the lexer. To give you a sense of that, a couple of examples.
In this example:
{}/a/g
/a/g
is a RegularExpressionLiteral
, but in this one:
+{}/a/g
/a/g
is a division.
In case of )
you can have a division:
('a')/a/g
as well as a RegularExpressionLiteral
:
if ('a')/a/g
So, unfortunately, it looks like you can't solve it with the lexer alone. Or you'll have to bring in so much grammar into the lexer so it's no lexer anymore.
This is a problem.
Now, a possible solution, which is, in my case JavaCC-based.
I am not sure if you have similar features in other parser generators, but JavaCC has a lexical states feature which can be used to switch between "we expect a DivPunctuator
" and "we expect a RegularExpressionLiteral
" states. For instance, in this grammar the NOREGEXP
state means "we don't expect a RegularExpressionLiteral
here".
This solves part of the problem, but not the ambiguous )
, }
, ++
and --
.
For this, you'll need to be able to switch lexical states from the parser. This is possible, see the following question in JavaCC FAQ:
Can the parser force a switch to a new lexical state?
Yes, but it is very easy to create bugs by doing so.
A lookahead parser may have already gone too far in the token stream (i.e. already read /
as a DIV
or vice versa).
Fortunately there seems to be a way to make switching lexical states a bit safer:
Is there a way to make SwitchTo safer?
The idea is to make a "backup" token stream and push tokens read during lookahead back again.
I think that this should work for }
, )
, ++
, --
as they are normally found in LOOKAHEAD(1) situations, but I am not 100% sure of that. In the worst case the lexer may have already tried to parse /
-starting token as a RegularExpressionLiteral
and failed as it was not terminated by another /
.
In any case, I see no better way of doing that. The next good thing would be probably to drop the case altogether (like JSLint
and many others did), document and just not parse these types of expressions. {}/a/g
does not make much sense anyway.
JSLint appears to expect a regular expression if the preceding token is one of
(,=:[!&|?{};
Rhino always returns a DIV token from the lexer.
You can only know how to interpret the / by also implementing a syntax parser. Whichever lex path arrives at a valid parse determines how to interpret the character. Apparently, this is something they had considered fixing, but didn't.
More reading here:
http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
See section 7:
There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a leading division (/) or division-assignment (/=) operator is permitted. The InputElementRegExp symbol is used in other syntactic grammar contexts.
NOTE There are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral are permitted. This is not affected by semicolon insertion (see 7.9); in examples such as the
following:
a = b
/hi/g.exec(c).map(d);
where the first non-whitespace, non-comment character after a LineTerminator is slash (/) and the syntactic context allows division or division-assignment, no semicolon is inserted at the LineTerminator. That is, the above example is interpreted in
the same way as:
a = b / hi / g.exec(c).map(d);
I agree, it's confusing and there should be one top-level grammar expression rather than two.
edit:
But there's nothing explaining when to use which.
Maybe the simple answer is staring us in the face: try one and then try the other. Since they are not both permitted, at most one will yield an error-free match.