Javacc Unreachable Statement

2019-04-14 10:59发布

In my grammar there are production rules for expressions and fragments which originally contained indirect left recursion. This is the rules after I removed the recursion from them.

String expression() #Expression : {String number; Token t;}
{
    number = fragment()
    (
        (t = <Mult_Sign> number = fragment())
    )
    {return number;}
}

String fragment() #void : {String t;}
{
    t = identifier() {return t;}
    | t = number() {return t;}
    | (<PLUS> | <MINUS> ) fragment()
    | <LBR> expression() <RBR>
}

These production rules are used when trying to parse a condition in the grammar. However the ordering of the production rules either has it so only expression is accepted. Yet it should accept something like while (x <= 10). If I have the production rules in the opposite order, as originally stated in the grammar. When I try compile the java file using javac. I receive an error which tells me identifier() is an unreachable statement. This is the condition production rule:

void condition() #void : {Token t;}
{
    <NOT> expression()
    | expression (<EQUALS>|<NOTEQUALS>|<LT>|<GT>|<LTE>|<GTE>|<AND>|<OR>) expression()
    | identifier()
}

If anyone could help tell me why this problem is occurring it would be very helpful.

1条回答
地球回转人心会变
2楼-- · 2019-04-14 11:40

You have

void condition() #void : {Token t;}
{
/*a*/     <NOT> expression()
/*b*/     | expression (<EQUALS>|<NOTEQUALS>|<LT>|<GT>|<LTE>|<GTE>|<AND>|<OR>) expression()
/*c*/     | identifier()
}

If the parser is looking for a condition, it will try to make the choice between the three alternatives based on the next token of input. If that token is an identifier, there is a problem, since either alternative (b) or alternative (c) could work. Faced with a choice conflict, JavaCC prefers the first, so (b) will be chosen. And if the next token is not an identifier, then alternative (c) will not be chosen. So either way alternative (c) will not be reached.


That is your problem. What should be done about it? Here is the usual solution.

If you want to allow further operators in expressions, make more nonterminals representing more levels of precedence. For example

condition --> expression
expression --> disjunct (OR expression)?
disjunct --> conjunct (AND disjunct)?
conjunct --> comparand ((EQ|NEQ|LT|GT|LE|GE) comparand)?
comparand --> term ((PLUS|MINUS) term)*
term --> fragment ((TIMES | DIVIDE) fragment)*
fragment --> identifier | number | LBR expression RBR | (PLUS|MINUS|NOT) fragment

This grammar will accept everything your want and probably more. For example, if you have

statement --> WHILE condition DO statement

your parser will accept e.g. "WHILE a+b DO a:=b". In many languages this is taken care of by type checking; Java does it this way. In other languages it is dealt with by allowing all sort of things as conditions; LISP does this.


A note on the precedence of NOT

Most languages treat the precedence of NOT as very high, as in the second part of this answer. This has the nice effect of eliminating all choice warnings as the grammar is LL(1).

However if you want unary operators to have lower precedence there is really nothing stopping you, if you use JavaCC. E.g. you could change fragment to

fragment --> identifier | number | LBR expression RBR | (PLUS|MINUS) fragment | NOT conjunct

Now the grammar is not LL(1) (it's not even unambiguous). So JavaCC will give some choice conflict warnings. But it will actually parse e.g. "NOT a LT b" as "NOT (a LT b)"


What almost no language does is what I think you are trying to do, which is to restrict the syntax so that only expressions that look like conditions are allowed to be conditions. If this is truly what you want, then you can do it with JavaCC using syntactic lookahead. Here is how you do it.

Start with a grammar like this one. (This is essentially your idea with more attention paid to levels of precedence.)

condition --> disjunct (OR condition)?
disjunct --> conjunct (AND disjunct)?
conjunct --> expression (EQ|NEQ|LT|GT|LE|GE) expression
           | LBR condition RBR
           | NOT conjunct
           | identifier

expression --> term ((PLUS|MINUS) term)*
term --> fragment ((TIMES | DIVIDE) fragment)*
fragment --> identifier | number | LBR expression RBR | (PLUS|MINUS) fragment

This is an unambiguous grammar for conditions. However it has a choice conflict at conjunct when the next token is an identifier or an LBR. To resolve this choice conflict you look ahead for the comparison operator using syntactic lookahead thus

void conjunct() : { } {
    LOOKAHEAD( expression() (<EQ>|<NEQ>|<LT>|<GT>|<LE>|<GE>) )
    expression() (<EQ>|<NEQ>|<LT>|<GT>|<LE>|<GE>) expression()
|   LBR condition() RBR
|   NOT conjunct()
|   identifier() {

So why does (almost) no programming language do it this way? Most languages have variables of boolean type and so, like you, allow identifiers as conditions. So you still have to do type checking to rule out "WHILE i DO ..." where "i" is not of boolean type. Also, what should you use for the syntax of assignment? You need

statement --> identifier := (expression | condition) | ...

Even syntactic lookahead won't tell you which choice is right for "x := y". This is an ambiguous grammar.

If either choice is acceptable in the cases where both choices parse, then you use syntactic lookahead here too.

void statement() : {} {
    identifier <BECOMES> (LOOKAHEAD(condition()) condition()) | expression())
| ...
}

This will parse "y" in "x:=y" as a condition even if it is numeric. If you are aware of this and design the rest of the compiler so everything still works, no harm is done.

Another disadvantage of this approach is that parsing is now quadratic time in theory. I don't think this is a serious concern.

查看更多
登录 后发表回答