AFAIK the technique for lexing Python source code is:
- When current line's indentation level is less than previous line's, produce DEDENT. Produce multiple DEDENTs if it is closing multiple INDENTs.
- When end of input is reached, produce DEDENT(s) if there's unclosed INDENT(s).
Now, using PLY:
- How do I return multiple tokens from a t_definition?
- How do I make a t_definition that's called when EOF is reached? Simple
\Z
doesn't work -- PLY complains that it matches empty string.
As far as I know, PLY does not implement a push parser interface, which is how you would most easily solve this problem with bison. However, it is very easy to inject your own lexer wrapper, which can handle the queue of dedent tokens.
A minimal lexer implementation needs to implement a
token()
method which returns an object withtype
andvalue
attributes. (You also need if your parser uses it, but I'm not going to worry about that here.)Now, let's suppose that the underlying (PLY-generated) lexer produces
NEWLINE
tokens whose value is the length of leading whitespace following the newline. If some lines don't participate in the INDENT/DEDENT algorithm, theNEWLINE
should be suppressed for those lines; we don't consider that case here. An simplistic example lexer function (which only works with spaces, not tabs) might be:Now we wrap the PLY-generated lexer with a wrapper which deals with indents: