What's the best way to write a parser by hand?

2019-01-31 04:22发布

We've used ANTLR to create a parser for a SQL-like grammar, and while the results are satisfactory in most cases, there are a few edge cases that we need to fix; and since we didn't write the parser ourselves we don't really understand it well enough to be able to make sensible changes.

So, we'd like to write our own parser. What's the best way to go about writing a parser by hand? What sort of parser should we use - recursive descent has been recommended; is that right? We'll be writing it in C#, so any tutorials for writing parsers in that language would be gratefully received.

UPDATE: I'd also be interested in answers that involve F# - I've been looking for a reason to use that in a project.

标签: c# .net parsing f#
16条回答
我只想做你的唯一
2楼-- · 2019-01-31 04:44

Recursive descent will give you the simplest way to go, but I would have to agree with mouviciel that flex and bison and definitely worth learning. When you find out you have a mistake in your grammar, fixing a definition of the language in flex /bison will be a hell of a lot easier then rewriting your recursive descent code.

FYI the C# parser is written recursive descent and it tends to be quite robust.

查看更多
做个烂人
3楼-- · 2019-01-31 04:45

The reason you don't want a table-driven parser is that you will not be able to create sensible error-messages. That is ok for a generated language, but not one where humans are involved. The error messages produced by c-like language compilers provide ample evidence that people can adapt to anything, no matter how bad.

查看更多
Emotional °昔
4楼-- · 2019-01-31 04:47

I suggest that you don't write the lexer by hand - use flex or similar. The task of recognising tokens is not that hard to do by hand, but I don't think you'd gain much.

As others have said, recursive descent parsers are easiest to write by hand. Otherwise you have to maintain the table of state transitions for each token, which isn't really human-readable.

I'm pretty sure ANTLR implements a recursive descent parser anyway: there's a mention of it in an interview about ANTLR 3.0.

I've also found a series of blog posts about writing a parser in C#. It seems quite gentle.

查看更多
孤傲高冷的网名
5楼-- · 2019-01-31 04:47

JFlex is a flex implementation for Java, and there is now a C# port of that project http://sourceforge.net/projects/csflex/. There also appears to be a C# port of CUP in progress, which can be found here: http://sourceforge.net/projects/datagraph/

I too would recommend avoiding hand crafting your own solution. I tried this once myself for a very simple language (part of a university project) and it was incredibly time consuming and difficult. It is also exceedingly hard to maintain and change once written.

Using an existing parser generator is the way to go, as the bulk of the hard work has been done and has been well tested over the years.

查看更多
再贱就再见
6楼-- · 2019-01-31 04:48

Well if you don't mind using another compiler compiler tool like ANTLR I suggest you take a look at Coco/R

I've used it in the past and it was pretty good...

查看更多
Lonely孤独者°
7楼-- · 2019-01-31 04:49

I would highly recommend the F# language as your language of choice for parsing on the .NET Platform. It's roots in the ML family of languages means it has excellent support for language-oriented programming.

Discriminated unions and pattern-matching allow for a very succinct and powerful specification of your AST. Higher-order functions allow for definition of parse operations and their composition. First-class support for monadic types allows for state management to be handled implicitly greatly simplifying the composition of parsers. Powerful type-inference greatly aides the definition of these (complex) types. And all of this can be specified and executed interactively allowing you to rapidly prototype.

Stephan Tolksdorf has put this into practice with his parser combinator library FParsec

From his examples we see how naturally an AST is specified:

type expr =
    | Val of string
    | Int of int
    | Float of float
    | Decr of expr

type stmt =
    | Assign of string * expr
    | While of expr * stmt
    | Seq of stmt list
    | IfThen of expr * stmt
    | IfThenElse of expr * stmt * stmt
    | Print of expr

type prog = Prog of stmt list

the implementation of the parser (partially elided) is just as succinct:

let stmt, stmtRef = createParserForwardedToRef()

let stmtList = sepBy1 stmt (ch ';')

let assign =
    pipe2 id (str ":=" >>. expr) (fun id e -> Assign(id, e))

let print = str "print" >>. expr |>> Print

let pwhile =
    pipe2 (str "while" >>. expr) (str "do" >>. stmt) (fun e s -> While(e, s))

let seq =
    str "begin" >>. stmtList .>> str "end" |>> Seq

let ifthen =
    pipe3 (str "if" >>. expr) (str "then" >>. stmt) (opt (str "else" >>. stmt))
          (fun e s1 optS2 ->
               match optS2 with
               | None    -> IfThen(e, s1)
               | Some s2 -> IfThenElse(e, s1, s2))

do stmtRef:= choice [ifthen; pwhile; seq; print; assign]


let prog =
    ws >>. stmtList .>> eof |>> Prog

On the second line, as an example, stmt and ch are parsers and sepBy1 is a monadic parser combinator that takes two simple parsers and returns a combination parser. In this case sepBy1 p sep returns a parser that parses one or more occurrences of p separated by sep. You can thus see how quickly a powerful parser can be combined from simple parsers. F#'s support for overridden operators also allow for concise infix notation e.g. the sequencing combinator and the choice combinator can be specified as >>. and <|>.

Best of luck,

Danny

查看更多
登录 后发表回答