I'm trying to use Antlr to make a very simple parser, that basically tokenizes a series of .
-delimited identifiers.
I've made a simple grammar:
r : STRUCTURE_SELECTOR ;
STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? ;
ID : [_a-z0-9$]* ;
WS : [ \t\r\n]+ -> skip ;
When the parser is generated, I end up with a single terminal node that represents the string instead of being able to find further STRUCTURE_SELECTOR
s. I'd like instead to see a sequence (perhaps represented as children of the current node). How can I accomplish this?
As an example:
.
would yield one terminal node whose text is.
.foobar
would yield two nodes, a parent with text.
and a child with textfoobar
.foobar.baz
would yield four nodes, a parent with text.
, a child with textfoobar
, a second-level child with text.
, and a third-level child with textbaz
.
Rules starting with a capital letter are Lexer rules.
With the following input file t.text
your grammar (in file Question.g4) produces the following tokens
The lexer (parser) is greedy. It tries to read as many input characters (tokens) as it can with the rule. The lexer rule
STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)?
can read a dot, an ID, and again a dot and an ID (due to repetition marker?
), till the NL. That's why each line ends up in a single token.When compiling the grammar, the error
comes because the repetition marker of ID is
*
(which means 0 or more times) instead of+
(one or more times).Now try this grammar :
and
$ grun Question r -gui t.text
displays the hierarchical tree structure you are expecting.