I have a grammar as such:
grammar Testquote;
program : (Line ';')+ ;
Line: L_S_STRING ;
L_S_STRING : '\'' (('\'' '\'') | ('\\' '\'') | ~('\''))* '\''; // Single quoted string literal
L_WS : L_BLANK+ -> skip ; // Whitespace
fragment L_BLANK : (' ' | '\t' | '\r' | '\n') ;
This grammar--and the L_S_STRING
in particular--seems working fine with vanilla inputs like:
'ab';
'cd';
However, it fails with this input:
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'';
'cd';
Yet works when I changed the first line to either
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z''';
or
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\' '
;
I sorta can see why the parser may choose this failed route. But is there some way I can tell it to choose differently?
According to ANTLR4 docs, both lexer and parser rules are greedy, thus matching as much input as they can. In your case:
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\'';
^^^
'cd';
Your grammar is somewhat ambiguous - the characters I've highlighted can be interpreted as \'
'
or as \
''
. See how it works.
Without 'cd'
, lexer matches a string because it's a valid string for your grammar, highlighted characters are matched as \'
'
. But since lexer is greedy, it will use the aforementioned ambiguity to match unwanted input at first possibility, such as adding another unescaped '
somewhere later.
This ambiguity is caused by possibility of backslash being either normal character or escape character. The common solution for removing such ambiguity is a rule for escaping the backslash itself: \\
, also you need to not match it as a normal character.
Alternatively, you may want to deal with ambiguity in a different way. If you want to prioritize \'
over ''
, you should write:
L_S_STRING : '\'' ( ('\'\'') | ('\\'+ ~'\\') | ~('\'' | '\\') )* '\'' ;
It will work for your input.
By the way, you can shorten your code for L_WS:
L_WS : [ \t\n\r]+ -> skip ;