I'm about to make a parser for a fictional subset of XML with the following EBNF grammar:
DOCUMENT ::= ELEMENT
ELEMENT ::= START_TAG (ELEMENT | DATA)* END_TAG | EMPTY_TAG
START_TAG ::= < NAME ATTRIBUTE* >
END_TAG ::= </ NAME >
EMPTY_TAG ::= < NAME ATTRIBUTE* />
ATTRIBUTE ::= NAME = STRING
The above is the grammar 'as is', without any changes. Here is my attempt at converting it to LL(1):
DOCUMENT ::= ELEMENT EOF
ELEMENT ::= PREFIX > ELEMENT_OR_DATA END_TAG
| PREFIX />
PREFIX ::= < NAME OPT_ATTR
ELEMENT_OR_DATA ::= OPT_ELEMENT ELEMENT_OR_DATA
| OPT_DATA ELEMENT_OR_DATA
| epsilon
OPT_ELEMENT ::= ELM_LIST | epsilon
ELM_LIST ::= ELEMENT | ELEMENT ELM_LIST
OPT_DATA ::= DATA_LIST | epsilon
DATA_LIST ::= DATA | DATA DATA_LIST
END_TAG ::= </ NAME >
OPT_ATTR ::= ATTR_LIST | epsilon
ATTR_LIST ::= ATTRIBUTE | ATTRIBUTE ATTR_LIST
ATTRIBUTE ::= NAME = STRING
EOF ::= &$
Is this an LL(1) version of the original? If not, where did I go wrong? And if so, is there any way to 'simplify' without changing the meaning? I'm not convinced I have the simplest possible version.
Hope this is clear.
LL(1) parser cannot pick correct rule for ELEMENT's two rules just by looking at the next token. According to the grammar, the parser should try the first rule:
ELEMENT ::= PREFIX > ELEMENT_OR_DATA END_TAG
And if it does not work, it must return from the recursion (backtracking) in order to try the second rule:ELEMENT ::= PREFIX />
The problem is that both rules start from the same "object" PREFIX. In this case, it is even "worse" because, it is not terminal.
Definitely, this is not LL(1) grammar. Let's try to build one.
We first simplify the original grammar, by removing TAG's:
DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTE* > (ELEMENT | DATA)* </ NAME > ELEMENT ::= < NAME ATTRIBUTE* /> ATTRIBUTE ::= NAME = STRING
The next step is to split rules for ELEMENT in order to get the first token, which will help to the parser to select correct rule.
DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTE* ELEMENT1 ELEMENT1 ::= > (ELEMENT | DATA)* </ NAME > ELEMENT1 ::= /> ATTRIBUTE ::= NAME = STRING
Now parser can successfully start parsing an element. It "postpones" the decision whether it is extended or short element, and delegates this question to the ELEMENT1-rules. The later one can determine which type of element is being parsed by checking whether the next token is
>
or/>
.Let's continue the transformation:
DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTES ELEMENT1 ELEMENT1 ::= > ELEMENT_OR_DATA </ NAME > ELEMENT1 ::= /> ELEM_OR_DATA ::= ELEMENT ELEM_OR_DATA ELEM_OR_DATA ::= DATA ELEM_OR_DATA ELEM_OR_DATA ::= epsilon ATTRIBUTES ::= NAME = STRING ATTRIBUTES ATTRIBUTES ::= epsilon
We just replaced *-operator with proper LL syntax. This last grammar still has some issue: the first two ELEM_OR_DATA rules may "confuse" the parser, because it cannot guess which one to apply (similar problem to one we discussed at the beginning).
Let's solve this issue by giving a hint to the parser:
DOCUMENT ::= ELEMENT EOF ELEMENT ::= < ELEMENT0 ELEMENT0 ::= NAME ATTRIBUTES ELEMENT1 ELEMENT1 ::= > ELEMENT_OR_DATA </ NAME > ELEMENT1 ::= /> ELEM_OR_DATA ::= < ELEMENT0 ELEM_OR_DATA ELEM_OR_DATA ::= DATA ELEM_OR_DATA ELEM_OR_DATA ::= epsilon ATTRIBUTES ::= NAME = STRING ATTRIBUTES ATTRIBUTES ::= epsilon
We splitted ELEMENT1 and used ELEMENT0 in the first ELEM_OR_DATA rule. Now assuming DATA is a token, parser can easily determine which rule to apply by looking at the next token only.