Combining lexer and parser in a parser combinator

2019-02-18 05:43发布

I'm using uu-parsinglib, but I think the following question is parser combinator generic.

Let's consider the following example:

I've got a lexer with a combinator pLex, which produces a list of tokens (of type MyToken). I now want to write a parser, which will consume the tokens and build an AST.

What is the best way to connect the lexer and parser? Right now I have a lex function:

lex s = parse ( (,) <$> pLex <*> pEnd) (createStr (LineColPos 0 0 0) s)

Should I create a function parse p = ...? If yes, how do I construct it to keep track of columns and lines from lexer? Or should I create a parserCombinator, which would use the pLex combinator somehow?

2条回答
做个烂人
2楼-- · 2019-02-18 05:53

I think there is nothing in uu-parsinglib which prevents you from using an input different from Text. It is only that for Text (and friends) we have provided quite some functions you are likely to need. If you look at the older uulib parser combinators you will find a scanner based approach, which can be used just as well with the newer uu-parsinglib.

If you want to process a lot of data maybe it is better to have separate scannning phase. Error messages tend to be more informative. In the uulib you will find some support for writing your scanner (most languages somehow put some special restrictions/requirements on lexical structure that quite some tools will (fail/need to be adapted) to create your scanner (e.g. the offside rule))

查看更多
Anthone
3楼-- · 2019-02-18 06:08

Table-based parsers require separation of lexical analysis and parsing because of their limited lookahead capability. Looking ahead far enough to combine lexical analysis into the parser would explode the state space.

Combinator-based approaches do not usually suffer this problem, as they are typically doing recursive-descent parsing. Unless otherwise noted by the library author, there is no harm in combining the phases and not much to gain by separating them.

Although uu-parsinglib provides the Str class to abstract over different string-like inputs, looking at its definition shows that it still assumes that you are ultimately reading a sequence of Char, whether they be from a String, ByteString, Text, etc. So trying to get it to parse a MyToken stream seems like it could be difficult. Parsec might be a better choice if you feel you need to do that.

As to your question about your string implementation, combinators take a string-like input containing syntactic structure and return the corresponding semantic value, if they match. Inside the combinator, you get to build that semantic value from what you parse directly by taking from the input stream and by combining the semantic values from sub-combinators you call.

So, your 'String matching' combinator in your example will have a list of tokens in its scope thanks to the parsing it did. You can use the full power of Haskell to combine those tokens into a single MyString value in whatever way makes sense for your language: Maybe a 'SplicedString' type that represents what values are to be sliced into it.

The string combinator was probably called by an 'expression' combinator, which will be able to combine the MyString value with other parsed values into a MyExpression value. It's combinators returning semantic values all the way back up!

查看更多
登录 后发表回答