Developing a simple parser

2020-02-09 17:33发布

My day job includes working to develop a Pascal-like compiler. I've been working all along on optimizations and code generation.

I would also like to start learning to build a simple parser for the same language. I'm however, not really sure how to go about this. Flex and Bison seem to be the choice. But, isn't it possible to write a parser using C++ or C#? I'm a bit creepy with C.

Yacc++ supports C#, but it's a licensed one. I'm looking for all the help that I can find in this regard. Suggestions would be highly appreciated.

9条回答
混吃等死
2楼-- · 2020-02-09 18:01

Personally, I roll my own lexer and parser (LL). Here's a very-abbreviated example. It is in C++, but hopefully you can adapt it. It makes use of a macro PARSE_HIGHER to make it easy to insert operators at different precedence levels without much code changing.

 // routine to scan over whitespace/comments
void ScanWhite(const char* &pc){
  while(true){
    if(0);
    else if (WHITESPACE(*pc)) pc++;
    else if (pc[0]=='/' && pc[1]=='/'){
      while(*pc && *pc++ != '\n');
    }
    else break;
  }
}
// routine to lex an identifier
bool SeeId(const char* &pc, string sId){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (alpha(*pc)){
    sId = "";
    while(alphanum(*pc)) sId += (*pc++);
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex a number
bool SeeNum(const char* &pc, double &dNum){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (digit(*pc)){
    dNum = 0;
    while(digit(*pc)) dNum = dNum * 10 + (*pc++ - '0');
    if (*pc == '.'){
      double divisor = 1, frac = 0;
      while(digit(*pc)){
        divisor *= 0.1;
        frac += (*pc++ - '0') * divisor;
      }
      dNum += frac;
    }
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex some constant word
bool SeeWord(const char* &pc, const char* sWord){
  ScanWhite(pc);
  const char* pc0 = pc;
  int len = strlen(sWord);
  if (strncmp(pc, sWord, len)==0 && !alphanum(pc[len])){
    pc += len;
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex a single character like an operator
bool SeeChar(const char* &pc, const char c){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (*pc == c){
    pc++;
    return true;
  }
  pc = pc0;
  return false;
}
// primitive expression parser
void ParsePrimitiveExpr(const char* &pc, CNode* &p){
  double dNum;
  char sId[100];
  if (0);
  else if (SeeNum(pc, dNum)){
    p = new CNode(dNum);
  }
  else if (SeeId(pc, sId)){
    // see if its a function call
    if (SeeChar(pc, '(')){
      p = MakeNewFunctionCallNode(sId);
      while(!SeeChar(pc, ')')){
        CNode* p1 = null;
        ParseExpression(pc, p1);
        AddArgumentExprToFunctionCallNode(p, p1);
        SeeChar(pc, ','); /* optional comma separator */
      }
    }
    // otherwise its just a variable reference
    else {
      p = new CNode(sId);
    }
  }
  // handle embedded expressions
  else if (SeeChar(pc, '(')){
    ParseExpression(pc, p);
    if (!SeeChar(pc, ')')) /* deal with syntax error */
  }
}
#define PARSE_HIGHER ParsePrimitiveExpr
// product parser
void ParseProduct(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
  while(true){
    if (0);
    else if (SeeChar(pc, '*')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('*', p, p1);
    }
    else if (SeeChar(pc, '/')){
     CNode p1 = null;
     PARSE_HIGHER(pc, p1);
     p = new CNode('/', p, p1);
   }
   else break;
  }
}
#undef  PARSE_HIGHER
#define PARSE_HIGHER ParseProduct
// sum parser
void ParseSum(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
  while(true){
    if (0);
    else if (SeeChar(pc, '+')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('+', p, p1);
    }
    else if (SeeChar(pc, '-')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('-', p, p1);
    }
   else break;
  }
}
#undef  PARSE_HIGHER
// can insert more routines like the above
// to handle move operators
#define PARSE_HIGHER ParseSum
// overall expression parser
void ParseExpression(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
}

Added some Pascal-style statement syntax:

void ParseStatement(const char* &pc){
  char sId[100];
  if(0);
  else if (SeeWord(pc, "begin")){
    while(!SeeWord(pc, "end")){
      ParseStatement(pc);
      SeeChar(pc, ';');
    }
  }
  else if (SeeWord(pc, "while")){
    CNode* p1 = null;
    ParseExpression(pc, p1);
    ParseStatement(pc);
    /* semantics for while statement */
  }
  else if (SeeWord(pc, "if")){
    CNode* p1 = null;
    ParseExpression(pc, p1);
    ParseStatement(pc);
    if (SeeWord(pc, "else")){
      ParseStatement(pc);
    }
    /* semantics for if statement */
  }
  else if (SeeWord(pc, "for")){
    /* you do it */
  }
  // handle assignments and subroutine calls
  else if (SeeId(pc, sId)){
    if(0);
    else if (SeeChar(pc, '=')){
      CNode* p1 = null;
      ParseExpression(pc, p1);
      /* semantics for assignment statement */
    }
    else if (SeeChar(pc, '(')){
      CNode* p = MakeNewFunctionCallNode(sId);
      while(!SeeChar(pc, ')')){
        CNode* p1 = null;
        ParseExpression(pc, p1);
        AddArgumentExprToFunctionCallNode(p, p1);
        SeeChar(pc, ','); /* optional comma separator */
      }
    }
    else {
      /* we have a 1-word statement, which is OK in pascal */
    }
  }
  else {
    /* syntax error */
  }
}

It still needs syntax for array indexing, variable declaration, and function definition, but I hope it is clear how to do that.

查看更多
\"骚年 ilove
3楼-- · 2020-02-09 18:02

If you were writing it in Java I'd recommend ANTLR. It's a nice LL(*) parser-generator written in Java. There's a terrific book for it on Amazon, too.

查看更多
Animai°情兽
4楼-- · 2020-02-09 18:03

You can actually use flex & bison with C++. In this tutorial, for example, you can see that section 5 is dedicated to that matter. Just google for it, and I'm sure you will find lots of examples.

查看更多
▲ chillily
5楼-- · 2020-02-09 18:06

When you use Lex and Yacc you don't actually write much of anything in C. Lex is its own language, as is Yacc. So you write the lexical analyzer in Lex and the parser in Yacc. However, for Pascal, Lex and Yacc inputs are already available.

The resulting parser and lexer have C interfaces, that is true. However most languages, including C++, have simple ways to call (or wrap) C interfaces.

I'm not an expert in it, but I'm sure all of the above goes for ANTLR as well.

If you are asking to do it in "pure C++" (whatever that means), look into using boost spirit. I don't really see the point in theoretical purity if it causes a ton more work though.

Writing your own lexers and parsers by hand is actually kinda fun. A lexer is one of the very few situations where you can justify using both gotos and the preprocessor. However, I wouldn't suggest it for a full-blown language like Pascal if you can avoid it. That would be a lot of work. I'm talking man-years.

查看更多
SAY GOODBYE
6楼-- · 2020-02-09 18:09

I believe you can use ANTLR with C#. I've never tried it myself (yet), however there is a tutorial here that might point you in the right direction.

查看更多
地球回转人心会变
7楼-- · 2020-02-09 18:13

bison & flex are the canonical parser generators. If you're interested in C++, I've found boost spirit useful. I've never used it for anything as complex as a compiler though. I'm sure others will have interesting suggestions for other languages such as C#...

查看更多
登录 后发表回答