AFAIK the technique for lexing Python source code is:
- When current line's indentation level is less than previous line's, produce DEDENT. Produce multiple DEDENTs if it is closing multiple INDENTs.
- When end of input is reached, produce DEDENT(s) if there's unclosed INDENT(s).
Now, using PLY:
- How do I return multiple tokens from a t_definition?
- How do I make a t_definition that's called when EOF is reached? Simple
\Z
doesn't work -- PLY complains that it matches empty string.
As far as I know, PLY does not implement a push parser interface, which is how you would most easily solve this problem with bison. However, it is very easy to inject your own lexer wrapper, which can handle the queue of dedent tokens.
A minimal lexer implementation needs to implement a token()
method which returns an object with type
and value
attributes. (You also need if your parser uses it, but I'm not going to worry about that here.)
Now, let's suppose that the underlying (PLY-generated) lexer produces NEWLINE
tokens whose value is the length of leading whitespace following the newline. If some lines don't participate in the INDENT/DEDENT algorithm, the NEWLINE
should be suppressed for those lines; we don't consider that case here. An simplistic example lexer function (which only works with spaces, not tabs) might be:
# This function doesn't handle tabs. Beware!
def t_NEWLINE(self, t):
r'\n(?:\s*(?:[#].*)?\n)*\s*'
t.value = len(t.value) - 1 - t.value.rfind('\n')
return t
Now we wrap the PLY-generated lexer with a wrapper which deals with indents:
# WARNING:
# This code hasn't been tested much and it also may be inefficient
# and/or inexact. It doesn't do python-style tab handling. Etc. etc.
from collections import namedtuple, deque
# These are the tokens. We only generate one of each here. If
# we used lineno or didn't trust the parser to not mess with the
# token, we could generate a new one each time.
IndentToken = namedtuple('Token', 'type value')
dedent = IndentToken('DEDENT', None)
indent = IndentToken('INDENT', None)
newline= IndentToken('NEWLINE', None)
class IndentWrapper(object):
def __init__(self, lexer):
"""Create a new wrapper given the lexer which is being wrapped"""
self.lexer = lexer
self.indent_stack = [0]
# A queue is overkill for this case, but it's simple.
self.token_queue = deque()
# This is just in case the ply-generated lexer cannot be called again
# after it returns None.
self.eof_reached = False
def token(self):
"""Return the next token, or None if end of input has been reached"""
# Do we have any queued tokens?
if self.token_queue:
return self.token_queue.popleft()
# Are we done?
if self.eof_reached:
return None
# Get a token
t = self.lexer.token()
if t is None:
# At end of input, we might need to send some dedents
self.eof_reached = True
if len(self.indent_stack) > 1:
t = dedent
for i in range(len(self.indent_stack) - 1):
self.token_queue.append(dedent)
self.indent_stack = [0]
elif t.type == "NEWLINE":
# The NEWLINE token includes the amount of leading whitespace.
# Fabricate indent or dedents as/if necessary and queue them.
if t.value > self.indent_stack[-1]:
self.indent_stack.append(t.value)
self.token_queue.append(indent)
else:
while t.value < self.indent_stack[-1]:
self.indent_stack.pop()
self.token_queue.append(dedent)
if t.value != self.indent_stack[-1]:
raise IndentError # Or however you indicate errors
return t