YAML parsing - lex or hand-rolled?

2020-06-04 05:20发布

问题:

I am trying to write a simple YAML parser, I read the spec from yaml.org, before I start, I was wondering if it is better to write a hand-rolled parser, or use lex (flex/bison). I looked at the libyaml (C library) - doesn't seem to use lex/yacc. YAML (excluding the flow styles), seems to be more line-oriented, so, is it easier to write a hand-rolled parser, or use flex/bison Thanks.

回答1:

This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.

The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.

I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.

Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:

  • Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
  • Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
  • Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
  • While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
  • Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.

This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.