How would you parse indentation (python style)?

2019-02-03 19:05发布

How would you define your parser and lexer rules to parse a language that uses indentation for defining scope.

I have already googled and found a clever approach for parsing it by generating INDENT and DEDENT tokens in the lexer.

I will go deeper on this problem and post an answer if I come to something interesting, but I would like to see other approaches to the problem.

EDIT: As Charlie pointed out, there is already another thread very similar if not the same. Should my post be deleted?

2条回答
手持菜刀,她持情操
2楼-- · 2019-02-03 19:40

This is kind of hypothetical, as it would depend on what technology you have for your lexer and parser, but the easiest way would seem to be to have BEGINBLOCK and ENDBLOCK tokens analogous to braces in C. Using the "offsides rule" your lexer needs to keep track of a stack of indendtation levels. When the indent level increases, emit a BEGINBLOCK for the parser; when the indentation level decreases, emit ENDBLOCK and pop levels off the stack.

Here's another discussion of this on SO, btw.

查看更多
Rolldiameter
3楼-- · 2019-02-03 19:40

Also you can track somewhere in lexer how many ident items are preceding first line and pass it to parser. Most interesting part would be trying to pass it to parser correctly :) If your parser uses lookahead (here I mean parser may query for variable number of tokens before it really going to match even one) then trying to pass it through one global variable seems to be very bad idea (because lexer can slip on next line and change value of indent counter while parser is still trying to parse previous line). Also globals are evil in many other cases ;) Marking first line 'real' token in someway with indent counter is more reasonable. I can't give you exact example (I don't even know what parser and lexer generators are you going to use if any...) but something like storing data on first line tokens (it could be non comfortable if you can't easily get such token from parser) or saving custom data (map that links tokens to indent, array where every line in source code as index and indent value as element value) seems to be enough. One downside of this approach is additional complexity to parser that will need to distinguish between ident values and change its behavior based on it. Something like LOOKAHEAD({ yourConditionInJava }) for JavaCC may work here but it is NOT a very good idea. A lot of additional tokens in your approach seems to be less evil thing to use :)

As another alternative I would suggest is to mix this two approaches. You could generate additional tokens only when indent counter changes its value on next line. It is like artificial BEGIN and END token. In this way you may lower number of 'artificial' tokens in your stream fed into parser from lexer. Only your parser grammar should be adjusted to understand additional tokens...

I didn't tried this (have no real experience with such languages parsing), just sharing my thoughts about possible solutions. Checking already built parsers for this kinds of languages could be of great value for you. Open source is your friend ;)

查看更多
登录 后发表回答