ANTLR lexer rule consumes too much

2019-08-14 06:11发布

问题:

ANTLR Lexer Rule Design

I have a requirement for the following token:

  • Allowable characters include uppercase, lowercase, numeric, space, and hyphen characters
  • Unfixed length (must be at least two characters in length)
  • Token must contain at least one space or hyphen
  • Token must start and end in an uppercase, lowercase, numeric, space, or hyphen character (cannot begin or end with a space)

The ANTLR lexer rule "AlphaNumericSpaceHyphen" in the grammar below almost works except for one case. Using the parser rule "sic" to test, the following input will parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION[4400]"

The following input fails to parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION [4400]"

The issue being that the lexer rule "AlphaNumericSpaceHyphen" consumes the space and the left square bracket after "WATER TRANSPORTATION" before the lexer realizes that there is no match because it went too far.

I have experimented with various type of predicates and look aheads without any luck. Any help is greatly appreciated.

grammar T;

sic: SICSpecifier AlphaNumericSpaceHyphen  LEFTBRACKET Digits RIGHTBRACKET;

LEFTBRACKET  
:   '[';  

RIGHTBRACKET 
:   ']';

SICSpecifier: 'STANDARD INDUSTRIAL CLASSIFICATION:';

WS : (' '|'\t')+ 
{   
  $channel = HIDDEN;  
};  

fragment UCASEALPHA : 'A'..'Z';
fragment LCASEALPHA : 'a'..'z';
fragment DIGIT : '0'..'9';
Digits: DIGIT+;

AlphaNumericSpaceHyphen 
:           (UCASEALPHA|LCASEALPHA |DIGIT|'-')+  (' ' (UCASEALPHA|LCASEALPHA |DIGIT|'-')+)+   
        |   (UCASEALPHA|LCASEALPHA |DIGIT)+ ('-')+  ((' '|UCASEALPHA|LCASEALPHA |DIGIT|'-')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?
        |   ('-')+ (UCASEALPHA|LCASEALPHA |DIGIT)+  ((UCASEALPHA|LCASEALPHA |DIGIT|'-'|' ')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?   
        ;

回答1:

Unfortunately there is no backtracking for the lexer rules. You can take a look at

ANTLR lexer rule consumes characters even if not matched?

You can try to adapt your grammar so that you can change the type of the token as it is suggested in this solution.

Hope this is going to help you.