parse bibtex with flex+bison: revisited

For last few weeks, I am trying to write a parser for bibtex (http://www.bibtex.org/Format/) file using flex and bison.

$ cat raw.l
%{
#include "raw.tab.h" 
%}
value [\"\{][a-zA-Z0-9 .\t\{\} \"\\]*[\"\}]
%%
[a-zA-Z]*               return(KEY);
\"                          return(QUOTE);
\{                          return(OBRACE);
\}                          return(EBRACE);
;                           return(SEMICOLON);
[ \t]+                  /* ignore whitespace */;
{value}     {
    yylval.sval = malloc(strlen(yytext));
    strncpy(yylval.sval, yytext, strlen(yytext));
    return(VALUE);
}

$ cat raw.y
%{
#include <stdio.h>
%}

//Symbols.
%union
{
 char *sval;
};
%token <sval> VALUE
%token KEY
%token OBRACE
%token EBRACE
%token QUOTE
%token SEMICOLON 

%start Entry
%%

Entry:
     '@'KEY OBRACE VALUE ',' 
     KeyVal
     EBRACE
     ;

KeyVal:
      /* empty */
      | KeyVal '=' VALUE ','
      | KeyVal '=' VALUE 
      ;
%%

int yyerror(char *s) {
  printf("yyerror : %s\n",s);
}

int main(void) {
  yyparse();

}

%% A sample bibtex is:

@Book{a1,
    author = "a {\"m}ook, Rudra Banerjee",
    Title="ASR",
    Publisher="oxf",
    Year="2010",
    Add="UK",
    Edition="1",
}
@Article{a2,
    Author="Rudra Banerjee",
    Title="Fe{\"Ni}Mo",
    Publisher={P{\"R}B},
    Issue="12",
    Page="36690",
    Year="2011",
    Add="UK",
    Edition="1",
}

When I am trying to parse it, its giving syntax error. with GDB, it shows it expect fields in KEY to be declared(probably),

Reading symbols from /home/rudra/Programs/lex/Parsing/a.out...done.
(gdb) Undefined command: "".  Try "help".
(gdb) Undefined command: "Author".  Try "help".
(gdb) Undefined command: "Editor".  Try "help".
(gdb) Undefined command: "Title".  Try "help".
.....

I will be grateful if someone kindly help me on this.

标签： parsing bison flex-lexer bibtex

1条回答

看我几分像从前

2楼-- · 2019-01-27 06:26

Lots of problems. First, your lexer is confused, trying to recognize quoted strings and braced things as a single VALUE as well as trying to recognize single characters like " and {. For quotes, it makes sense to have the lexer recognize the whole string, but for structural things that you want to parse (like braced lists), you need to return single tokens for the parser to parse. Second, when allocating space for a string, you aren't allocating space for a NUL-terminiator. Finally, your grammar looks odd, wanting parse things like = VALUE = VALUE as a KeyValue, which doesn't correspond to anything in a bibtex file.

So first, for the lexer. You want to recognize quoted strings and identifiers, but other things should be single characters:

[A-Za-z][A-Za-z0-9]*      { yylval.sval = strdup(yytext); return KEY; }
\"([^"\]|\\.)*\"          { yylval.sval = strdup(yytext); return VALUE; }
[ \t\n]                   ; /* ignore whitespace */
[{}@=,]                   { return *yytext; }
.                         { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

Now you need a parser for the entries:

Input: /* empty */ | Input Entry ;  /* input is zero or more entires */
Entry: '@' KEY '{' KEY ',' KeyVals '}' ;
KeyVals: /* empty */ | KeyVals KeyVal ; /* zero or more keyvals */
KeyVal: KEY '=' VALUE ',' ;

That should parse the example you give.

0人赞添加讨论(0) 举报

parse bibtex with flex+bison: revisited

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间