ANTLR Lexer Rule Design
I have a requirement for the following token:
- Allowable characters include uppercase, lowercase, numeric, space, and hyphen characters
- Unfixed length (must be at least two characters in length)
- Token must contain at least one space or hyphen
- Token must start and end in an uppercase, lowercase, numeric, space, or hyphen character (cannot begin or end with a space)
The ANTLR lexer rule "AlphaNumericSpaceHyphen" in the grammar below almost works except for one case. Using the parser rule "sic" to test, the following input will parse (without quotes):
"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION[4400]"
The following input fails to parse (without quotes):
"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION [4400]"
The issue being that the lexer rule "AlphaNumericSpaceHyphen" consumes the space and the left square bracket after "WATER TRANSPORTATION" before the lexer realizes that there is no match because it went too far.
I have experimented with various type of predicates and look aheads without any luck. Any help is greatly appreciated.
grammar T;
sic: SICSpecifier AlphaNumericSpaceHyphen LEFTBRACKET Digits RIGHTBRACKET;
LEFTBRACKET
: '[';
RIGHTBRACKET
: ']';
SICSpecifier: 'STANDARD INDUSTRIAL CLASSIFICATION:';
WS : (' '|'\t')+
{
$channel = HIDDEN;
};
fragment UCASEALPHA : 'A'..'Z';
fragment LCASEALPHA : 'a'..'z';
fragment DIGIT : '0'..'9';
Digits: DIGIT+;
AlphaNumericSpaceHyphen
: (UCASEALPHA|LCASEALPHA |DIGIT|'-')+ (' ' (UCASEALPHA|LCASEALPHA |DIGIT|'-')+)+
| (UCASEALPHA|LCASEALPHA |DIGIT)+ ('-')+ ((' '|UCASEALPHA|LCASEALPHA |DIGIT|'-')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?
| ('-')+ (UCASEALPHA|LCASEALPHA |DIGIT)+ ((UCASEALPHA|LCASEALPHA |DIGIT|'-'|' ')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?
;