Possible duplicate for this question however for me it's not specific enough.
The python grammar is claimed to be LL(1), but I've noticed some expressions in the Python grammar that really confuse me, for example, the arguments in the following function call:
foo(a)
foo(a=a)
corresponds to the following grammar:
argument: ( test [comp_for] |
test '=' test |
'**' test |
'*' test )
test
appears twice in the first position of the grammar. It means that by only looking at test
Python cannot determine it's test [comp_for]
or test '=' test
.
More examples:
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
Note 'is'
and 'is' 'not'
subscript: test | [test] ':' [test] [sliceop]
test
also appears twice.
Is my understanding of LL(1) wrong? Does Python do some workaround for the grammar during lexing or parsing to make it LL(1) processable? Thank you all in advance.
The grammar presented in the Python documentation (and used to generate the Python parser) is written in a form of Extended BNF which includes "operators" such as optionality (
[a]
) and Kleene closure ((a b c)*
). LL(1), however, is a category which appies only to simple context-free grammars, which do not have such operators. So asking whether that particular grammar is LL(1) or not is a category error.In order to make the question meaningful, the grammar would have to be transformed into a simple context-free grammar. This is, of course, possible but there is no canonical transformation and the Python documentation does not explain the precise transformation used. Some transformations may produce LL(1) grammars and other ones might not. (Indeed, naive translation of the Kleene star can easily lead to ambiguity, which is by definition not LL(k) for any k.)
In practice, the Python parsing apparatus transforms the grammar into an executable parser, not into a context-free grammar. For Python's pragmatic purposes, it is sufficient to be able to build a predictive parser with a lookahead of just one token. Because a predictive parser can use control structures like conditional statements and loops, a complete transformation into a context-free grammar is unnecessary. Thus, it is possible to use EBNF productions -- as with the documented grammar -- which are not fully left-factored, and even EBNF productions whose transformation to LL(1) is non-trivial:
In the above production, the repetition of
(';' small_stmt)*
may be followed by a';'
, which means that a simplewhile
loop will not correctly represent the production. I don't know how this production is handled by the Python parser generator, but it is possible to transform it into CFG by left-factoring after expanding the repetition:Similarly, the entire EBNF can be transformed into an LL(1) grammar. That is not done because the exercise is neither useful for parsing or for explaining the syntax. It would be hard to read, and the EBNF can be directly transformed into a parser.
This is slightly independent of the question of whether Python is LL(1), because a language is LL(1) precisely if an LL(1) grammar exists for the language. There will always be an infinitude of possible grammars for a language, including grammars which are not LL(k) for any k and even grammars which are not context-free, but that is irrelevant to the question of whether the language is LL(1): the language is LL(1) if even one LL(1) grammar exists. (I'm aware that this is not the original question, so I won't pursue this any further.)
You're correct that constructs like
'is' | 'is' 'not'
aren't LL(1). They can be left-factored to LL(1) quite easily by changing it to'is' notOpt
wherenotOpt: 'not' | ϵ
or, if you allow EBNF syntax, just'is' 'not'?
(or'is' ['not']
depending on the flavor of EBNF).So the language is LL(1), but the grammar technically is not. I assume the Python designers decided that this was okay because the left-factored version would be more difficult to read without much benefit and the current version can still be used as the basis for an LL(1) parser without much difficulty.