I want to achieve following behavior: User:class
should be parsed to Object - User; Type - class
, alsoUs:er:class
should result Object - Us:er; Type - class
. I can't make second part work, as soon as I add :
as a legal symbol for WORD
it parses whole input as an object Object - Us:er:class
.
My grammar:
grammar Sketch;
/*
* Parser Rules
*/
input : (object)+ EOF ;
object : objectName objectType? NEWLINE ;
objectType : ':' TYPE ;
objectName : WORD ;
/*
* Lexer Rules
*/
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment NUMBER : [0-9] ;
fragment WHITESPACE : (' ') ;
fragment SYMBOLS : [!-/:-@[-`] ;
fragment C : [cC] ;
fragment L : [lL] ;
fragment A : [aA] ;
fragment S : [sS] ;
fragment T : [tT] ;
fragment U : [uU] ;
fragment R : [rR] ;
TYPE : ((C L A S S) | (S T R U C T));
NEWLINE : ('\r'? '\n' | '\r')+ ;
WORD : (LOWERCASE | UPPERCASE | NUMBER | WHITESPACE | SYMBOLS)+ ;
Fragments for each letter are for case-insensitive parsing. As I understand, lexer gives priority to rules top-to-bottom, so TYPE should be picked over WORD, but I can't achieve it. I'm new to antlr4, maybe I'm missing something obvious.
If you just need to parse something so simple you do not need to write a parser with ANTLR. It is one of the very few cases where I would suggest just using a simple regex. If you want to solve it with ANTLR I would do it like this: 1) Ugly solution: you try to use predicates or actions to trick & force the parsing to do what you want 2) You simply define two tokens: one to get identifiers and one to get the semicolon. Then you do the composition later, in the code using your parser.
For example, for
User:class
you would get [[ID:"User"], [ID:"class"]] while forUs:er:class
you would get [[ID:"Us"], [ID:"er"], [ID:"class"]] then you code you know that the last ID represent the type and the sequence of all the other IDs represent the object.Neither are not great solutions but I think ANTLR is not the right tool for what you are trying to do.