What's the best way to write a parser by hand?

2019-01-31 04:22发布

We've used ANTLR to create a parser for a SQL-like grammar, and while the results are satisfactory in most cases, there are a few edge cases that we need to fix; and since we didn't write the parser ourselves we don't really understand it well enough to be able to make sensible changes.

So, we'd like to write our own parser. What's the best way to go about writing a parser by hand? What sort of parser should we use - recursive descent has been recommended; is that right? We'll be writing it in C#, so any tutorials for writing parsers in that language would be gratefully received.

UPDATE: I'd also be interested in answers that involve F# - I've been looking for a reason to use that in a project.

标签: c# .net parsing f#
16条回答
一夜七次
2楼-- · 2019-01-31 05:01

There is no "one best" way. Depending on your needs you may want bottom up (LALR1) or recursive descent(LLk). Articles such as this one give personal reasons for prefering LALR(1) (bottom up) to LL(k). However, each type of parser has its benefits and drawbacks. Generally LALR will be faster as a finite-state_machine is generated as a lookup table.

To pick what's right for you examine your situation; familiarize yourself with the tools and technologies. Starting with some LALR and LL Wikipedia articles isn't a bad choice. In both cases you should ALWAYS start with specifying the grammar in BNF or EBNF. I prefer EBNF for its succinctness.

Once you've gotten your mind wrapped around what you want to do, and how to represent it as a grammar, (BNF or EBNF) try a couple of different tools and run them across representative samples of text to be parsed.

Anecdotally:

However, I've heard that LL(k) is more flexible. I've never bothered to find out for myself. From my few parser building experiences I have noticed that regardles if it's LALR or LL(k) the best way to pick what's best for your needs is to start with writing the grammar. I've written my own C++ EBNF RD parser builder template library, used Lex/YACC and had coded a small R-D parser. This was spread over the better part of 15 years, and I spent no more than 2 months on the longest of the three projects.

查看更多
乱世女痞
3楼-- · 2019-01-31 05:03

In C/Unix, the traditional way is to use lex and yacc. With GNU, the equivalent tools are flex and bison. I don't know for Windows/C#.

查看更多
倾城 Initia
4楼-- · 2019-01-31 05:04

If you want to write it by hand, recursive decent is the most sensible way to go.

You could use a table parser, but that will be extremely hard to maintain.

Example:

Data = Object | Value;
Value = Ident, '=', Literal;
Object = '{', DataList, '}';
DataList = Data | DataList, Data;


ParseData {
  if PeekToken = '{' then 
    return ParseObject;
  if PeekToken = Ident then
    return ParseValue;
  return Error;
}

ParseValue {
  ident = TokenValue;
  if NextToken <> '=' then 
    return Error;
  if NextToken <> Literal then
    return Error;
  return(ident, TokenValue);
 }

ParseObject {
  AssertToken('{');
  temp = ParseDataList;
  AssertToken('}');
  return temp;
}

ParseDataList {
  data = ParseData;
  temp = []
  while Ok(data) {
    temp = temp + data;
    data = ParseData;
  }
}
查看更多
狗以群分
5楼-- · 2019-01-31 05:05

Adding my voice to the chorus in favor of recursive-descent (LL1). They are simple, fast, and IMO, not at all hard to maintain.

However, take a good look at your language to make sure it is LL1. If you have any syntax like C has, like ((((type))foo)[ ]) where you might have to descend multiple layers of parentheses before you even find out if you are looking at a type, variable, or expression, then LL1 will be very difficult, and bottom-up wins.

查看更多
登录 后发表回答