I am trying to build a simple LR parser for a type of template (configuration) file that will be used to generate some other files. I've read and read about LR parsers, but I just can't seem to understand it! I understand that there is a parse stack, a state stack and a parsing table. Tokens are read onto the parse stack, and when a rule is matched then the tokens are shifted or reduced, depending on the parsing table. This continues recursively until all of the tokens are reduced and the parsing is then complete.
The problem is I don't really know how to generate the parsing table. I've read quite a few descriptions, but the language is technical and I just don't understand it. Can anyone tell me how I would go about this?
Also, how would I store things like the rules of my grammar?
http://codepad.org/oRjnKacH is a sample of the file I'm trying to parse with my attempt at a grammar for its language.
I've never done this before, so I'm just looking for some advice, thanks.
The classic solution is the lex/yacc combo:
http://dinosaur.compilertools.net/yacc/index.html
Or, as gnu calls them - flex/bison.
edit:
Perl has Parse::RecDescent, which is a recursive descent parser, but it may work better for simple work.
you need to read about ANTLR
Well you can't understand it like
"Function A1 does f to object B, then function A2 does g to D etc"
its more like
"Function A does action {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o or p, or no-op} and shifts/reduces a certain count to objects {1-1567} at stack head of type {B,C,D,E,F,or G} and its containing objects up N levels which may have types {H,I,J,K or L etc} in certain combinations according to a rule list"
It really does need a data table (or code generated from a data table like thing, like a set of BNF grammar data) telling the function what to do.
You CAN write it from scratch. You can also paint walls with eyelash brushes. You can interpret the data table at run-time. You can also put Sleep(1000); statements in your code every other line. Not that I've tried either.
Compilers is complex. Hence compiler generators.
EDIT
You are attempting to define the tokens in terms of content in the file itself.
I assume the reason you "don't want to use regexes" is that you want to be able to access line number information for different tokens within a block of text and not just for the block of text as a whole. If line numbers for each word are unnecessary, and entire blocks are going to fit into memory, I'd be inclined to model the entire bracketed block as a token, as this may increase processing speed. Either way you'll need a custom
yylex
function. Start by generating one with lex with fixed markers "[" and "]" for content start and end, then freeze it and modify it to take updated data about what markers to look for from the yacc code.I looked at the definition of your fileformat, while I am missing some of the context why you would want specifically a LR parser, my first thought was why not use existing formats like xml, or json. Going down the parsergenerator route usually has a high startup cost that will not pay off for the simple data that you are looking to parse.
As paul said lex/yacc are an option, you might also want to have a look at Boost::Spirit.
I have worked with neither, a year ago wrote a much larger parser using QLALR by the Qt/Nokia people. When I researched parsers this one even though very underdocumented had the smallest footprint to get started (only 1 tool) but it does not support lexical analysis. IIRC I could not figure out C++ support in ANTLR at that time.
10,000 mile view: In general you are looking at two components a lexer that takes the input symbols and turns them into higher order tokens. To work of the tokens your grammar description will state rules, usually you will include some code with the rules, this code will be executed when the rule is matched. The compiler generator (e.g. yacc) will take your description of the rules and the code and turn it into compilable code. Unless you are doing this by hand you would not be manipulating the tables yourself.
In your study of parser theory, you seem to have missed a much more practical fact: virtually nobody ever even considers hand writing a table-driven, bottom-up parser like you're discussing. For most practical purposes, hand-written parsers use a top-down (usually recursive descent) structure.
The primary reason for using a table-driven parser is that it lets you write a (fairly) small amount of code that manipulates the table and such, that's almost completely generic (i.e. it works for any parser). Then you encode everything about a specific grammar into a form that's easy for a computer to manipulate (i.e. some tables).
Obviously, it would be entirely possible to do that by hand if you really wanted to, but there's almost never a real point. Generating the tables entirely by hand would be pretty excruciating all by itself.
For example, you normally start by constructing an NFA, which is a large table -- normally, one row for each parser state, and one column for each possible. At each cell, you encode the next state to enter when you start in that state, and then receive that input. Most of these transitions are basically empty (i.e. they just say that input isn't allowed when you're in that state).
You then step through all of those and follow some fairly simple rules to collect sets of NFA states together to become a state in the DFA. The rules are simple enough that it's pretty easy to program them into a computer, but you have to repeat them for every cell in the NFA table, and do essentially perfect book-keeping to produce a DFA that works correctly.
A computer can and will do that quite nicely -- for it, applying a couple of simple rules to every one of twenty thousand cells in the NFA state table is a piece of cake. It's hard to imagine subjecting a person to doing the same though -- I'm pretty sure under UN guidelines, that would be illegal torture.