I'm working on a parser, using GNU bison, and I'm facing an interesting problem. My actual problem is a bit different and less interesting in general, so I will state it a bit differently, so that an answer would be more useful in general.
I need to distinguish expressions depending on their type, like arithmetic expressions from string expressions. Their top-level nonterminals have some common ancestors, like
statement
: TOK_PRINT expression TOK_SEMICOLON {...}
;
expression
: arithmeticexpression {...}
| stringexpression {...}
;
Now I need to be able to have variables in both kinds of expressions
arithmeticexpression
: TOK_IDENTIFIER {...}
;
stringexpression
: TOK_IDENTIFIER {...}
;
(only the variables of string-type should be allowed in the stringexpression case and only variables of int- or float-type should be allowed in the arithmeticexpression case)
But this obviously causes an R/R conflict, which is inherent in the language - it is impossible to resolve, because the language is ambiguous.
Of course I could water down the language, so that only general "Expression" objects are passed around on the parse-stack, but then I'd have to do a lot of manual type-checking in the actions, which I would like to avoid.
Also, in my real use-case, the paths through the grammar to the variable-rules are so different, that I would have to water the language down so much that I would lose a lot of grammar-rules (i.e. lose a lot of structural information) and would need to hand-write parsing-engines into some actions.
I have read about GLR-parsing, which sounds like it could solve my problem. I'm considering to use this feature: have the grammar as above and YYERROR
in the paths where the corresponding variable has the wrong type.
arithmeticexpression
: TOK_IDENTIFIER {
if(!dynamic_cast<IntVariable*>(
symbol_table[*$<stringvalue>1]))
YYERROR;
}
;
stringexpression
: TOK_IDENTIFIER {
if(!dynamic_cast<StringVariable*>(
symbol_table[*$<stringvalue>1]))
YYERROR;
}
;
but the Bison Manual says
- During deterministic GLR operation, the effect of YYERROR is the same
as its effect in a deterministic parser.
- The effect in a deferred action is similar, but the precise point of
the error is undefined; instead, the parser reverts to deterministic
operation, selecting an unspecified stack on which to continue with a
syntax error.
- In a semantic predicate (see Semantic Predicates) during
nondeterministic parsing, YYERROR silently prunes the parse that invoked
the test.
I'm not sure I understand this correctly - I understand it this way:
- doesn't apply here, because GLR-parsing isn't deterministic
- is how I have it in the code above, but shouldn't be done this way, because it is completely random which path is killed by the YYERROR (correct me if I'm wrong)
- wouldn't solve my problem, because semantic predicates (not semantic actions!) must be at the beginning of a rule (correct me if I'm wrong), at which point the yylval from the TOK_IDENTIFIER token isn't accessible (correct me if I'm wrong), so I can't look into the symbol table to find the type of the variable.
Does anyone have experience with such a situation? Is my understanding of the manual wrong? How would you approach this problem? To me, the problem seems to be so natural that I would assume people run into it often enough that bison would have a solution built-in...
Note: Since at least bison 3.0.1 and as of bison 3.0.5, there is a bug in the handling of semantic predicate actions which causes bison to output a #line
directive in the middle of a line, causing the compile to fail. A simple fix is described in this answer, and has been committed to the maint
branch of the bison repository. (Building bison from the repository is not for the faint-hearted. But it would suffice to download data/c.m4
from that repository if you didn't want to just edit the file yourself.)
Let's start with the concrete question about YYERROR
in a GLR parser.
In effect, you can use YYERROR
in a semantic predicate in order to disambiguate a parse by rejecting the production it is part of. You cannot do this in a semantic action because semantic actions are not executed until the parser has decided on a unique parse, at which point there is no longer an ambiguity.
Semantic actions can occur anywhere in a rule, and will be immediately executed when they are reduced, even if the reduction is ambiguous. Because they are executed out of sequence, they cannot refer to the value produced by any preceding semantic action (and semantic predicates themselves have no semantic values even though they occupy a stack slot). Also, because they are effectively executed speculatively as part of a production which might not be part of the final parse, they should not mutate the parser state.
The semantic predicate does have access to the lookahead token, if there is one, through the variable yychar
. If yychar
is neither YYEMPTY
nor YYEOF
, the associated semantic value and location information are in yylval
and yylloc
, respectively. (see Lookahead Tokens). Care must be taken before using this feature, though; if there is no ambiguity at that point in the grammar, it is likely that the bison parser will have executed the semantic predicate without reading the lookahead token.
So your understanding was close, but not quite correct:
For efficiency, the GLR parser recognizes when there is no ambiguity in the parse, and uses the normal straightforward shift-reduce algorithm at those points. In most grammars, ambiguities are rare and are resolved rapidly, so this optimisation means that the overhead associated with maintaining multiple alternatives can be avoided for the majority of the parse. In this mode (which won't apply if ambiguity is possible, of course), YYERROR
leads to standard error recovery, as in a shift-reduce parser, at the point of reduction of the action.
Since deferred actions are not run until ambiguity has been resolved, YYERROR
in a deferred action will effectively be executed as though it were in a deterministic state, as in the previous paragraph. This is true even if there is a custom merge procedure; if a custom merge is in progress, all the alternatives participating in the custom merge will be discarded. Afterwards, the normal error recovery algorithm will continue. I don't think that the error recovery point is "random", but predicting where it will occur is not easy.
As indicated above, semantic predicates can appear anywhere in a rule, and have access to the lookahead token if there is one. (The Bison manual is a bit misleading here: a semantic predicate cannot contain an invocation of YYERROR
because YYERROR
is a statement, and the contents of the semantic predicate block is an expression. If the value of the semantic predicate is false, then the parser performs the YYERROR
action.)
In effect, that means that you could use semantic predicates instead of (or as well as) actions, but basically in the way that you propose:
arithmeticexpression
: TOK_IDENTIFIER %?{
dynamic_cast<IntVariable*>(symbol_table[*$1])
}
stringexpression
: TOK_IDENTIFIER %?{
dynamic_cast<StringVariable*>(symbol_table[*$1])
}
Note: I removed the type declarations from $<stringvalue>1
because:
It's basically unreadable, at least to me,
More importantly, like an explicit cast, it is not typesafe.
You should declare the semantic type of your tokens:
%type <stringvalue> ID
which will allow bison to type-check.
Whether or not using semantic predicates for that purpose is a good idea is another question.
The example here works, but when I put it together with the solution for Bison: GLR-parsing of valid expression fails without error message I ran into the following problem:
The variable could be determined by more than one identifier. You could have an array-index or it could be a member of a different object. I modeled this situation by having another non-terminal like
lvalue
: TOK_IDENTIFIER
| lvalue '[' arithmeticexpression ']'
| lvalue '.' TOK_IDENTIFIER
But when I now have
arithmeticexpression : lvalue
stringexpression : lvalue
and (try to) access the object from the non-terminal "lvalue" like above, I get a segmentation fault. So it seems that the method here doesn't work for more complex situations.
What I did instead now is that I access the object in the semantic ACTION and set $$
to nullptr
if the variable has the wrong type.
arithmeticexpression
: lvalue
{
$$ = nullptr;
if(dynamic_cast<IntVariable*>($lvalue->getVariable()))
$$ = new ArithmeticExpressionIntVariable($lvalue);
}
Then I had to propagate the nullptr
(this is quite a lot of additional code) like this
arithmeticexpression
...
| arithmeticexpression[left] '+' arithmeticexpressionM[right]
{
$$ = nullptr;
if($left && $right)
$$ = new ArithmeticExpressionPlus($left, $right);
}
So now, at the
expression
: arithmeticexpression
| stringexpression
(note: I have "Expression* expr;" in the "%union{" declaration
and "%type <expression> expr" in the prologue)
I have an ambiguity: the same input text can be parsed in two different ways, but only one of them will have a value != nullptr
. At this point, I need a custom merge procedure which basically just selects the non-null value.
For this, I forward-declared it in the prologue of my bison file like this
static Expression* exprMerge (yy::semantic_type x0, yy::semantic_type x1);
and defined it in the epilogue like this
static Expression* exprMerge (YYSTYPE x0, YYSTYPE x1) \
{
/* otherwise we'd have a legitimate ambiguity */
assert(!x0.expr || !x1.expr);
return x0.expr ? x0.expr : x1.expr;
}
finally, I had to tell bison to use this procedure to resolve the ambiguity, using
expression
: arithmeticexpression %merge <exprMerge>
| stringexpression %merge <exprMerge>
Honestly, I think this is quite a lot of effort, which wouldn't be necessary if bison tested semantic predicates (at least the ones which are behind the rule) when it tries to merge the paths, not to rule out paths before they merge.
But at least this works and is far less effort than collecting the tokens and sorting them manually, which would have been the alternative.