Where should I draw the line between lexer and par

I'm writing a lexer for the IMAP protocol for educational purposes and I'm stumped as to where I should draw the line between lexer and parser. Take this example of an IMAP server response:

* FLAGS (\Answered \Deleted)

This response is defined in the formal syntax like this:

mailbox-data   = "FLAGS" SP flag-list
flag-list      = "(" [flag *(SP flag)] ")"
flag           = "\Answered" / "\Deleted"

Since they are specified as string literals (aka "terminal" tokens) would it be more correct for the lexer to emit a unique token for each, like:

(TknAnsweredFlag)
(TknSpace)
(TknDeletedFlag)

Or would it be just as correct to emit something like this:

(TknBackSlash)
(TknString "Answered")
(TknSpace)
(TknBackSlash)
(TknString "Deleted")

My confusion is that the former method could overcomplicate the lexer - if \Answered had two meanings in two different contexts the lexer wouldn't emit the right token. As a contrived example (this situation won't occur because e-mail addresses are enclosed in quotes), how would the lexer deal with an e-mail address like \Answered@googlemail.com? Or is the formal syntax designed to never allow such an ambiguity to arise?

回答1:

As a general rule, you don't want lexical syntax to propagate into the grammar, because its just detail. For instance, a lexer for a computer programming langauge like C would certainly recognize numbers, but it is generally inappropriate to produce HEXNUMBER and DECIMALNUMBER tokens, because this isn't important to the grammar.

I think what you want are the most abstract tokens that allows your grammar to distinguish cases of interest relative to your purpose. You get to mediate this by confusion caused in one part of the grammar, by choices you might make in other parts.

If your goal is simply to read past the flag values, then in fact you don't need to distinguish among them, and a TknFlag with no associated content would be good enough.

If your goal is to process the flag values individually, you need to know if you got an ANSWERED and/or DELETED indications. How they are lexically spelled is irrelevant; so I'd go with your TknAnsweredFlag solution. I would dump the TknSpace, because in any sequence of flags, there must be intervening spaces (your spec say so), so I'd try to eliminate using whatever whitespace supression machinery you lexer offers.

On occasion, I run into situations where there are dozens of such flag-like things. Then your grammar starts to become cluttered if you have a token for each. If the grammar doesn't need to know specific flags, then you should have a TknFlag with associated string value. If a small subset of the flags are needed by the grammar to distinguish, but most of them are not, then you should compromise: have individual tokens for those flags that matter to the grammar, and a catch all TknFlag with associated string for the rest.

Regarding the difficulty in having two different interpretations: this is one of those tradeoffs. If you have that issue, then your tokens either need to have fine enough detail in both places where they are needed in the grammar so you can discriminate. If "\" is relevant as a token somewhere else in the grammar, you certainly could produce both TknBackSlash and TknAnswered. However, if the way something is treated in one part of the grammar is different than another, you can often get around this using a mode-driven lexer. Think of modes as being a finite state machine, each with an associated (sub)lexer. Transitions between modes are triggered by tokens that are cues (you must have a FLAGS token; it is preciseuly such a cue that you are about to pick up flag values). In a mode, you can produce tokens that other modes would not produce; thus in one mode, you might produce "\" tokens, but in your flag mode you wouldn't need to. Mode support is pretty common in lexers because this problem is more common that you might expect. See Flex documentation for an example.

The fact that you are asking the question shows you are on the right track for making a good choice. You need to balance the maintainability goal of minimizing tokens (technically you can parse using a token for ever ASCII character!) with fundamental require to discriminate well enough for your needs. After you've built a dozen grammars this tradeoff will seem easy, but I think the rules of thumb I've provided are pretty good.

回答2:

I'd come up with the CFG first and whatever terminals it needs to do its job is what the lexer should recognize; otherwise you are just guessing at the proper way to tokenize the string.

回答3:

I'd recommend to avoid separating lexer and parser - modern parsing approaches (like PEGs) allows to mix lexing and parsing. This way you won't need tokens at all.