Using boost spirit for a stack based language

2020-07-13 08:33发布

问题:

I need to parse a rather simple stack based language, e.g.

1 2 add
3 1 sub

and I'm facing two choices here:

  1. Write my own lexer for the tokens and then proceed parsing it
  2. Use boost spirit

I have never used boost spirit but from what I've read around (documentation and examples) I can't still make up my mind on whether it would be overkill to use boost spirit to lex and parse this simple language or if it would make sense to use it instead of rolling out my own lexer and parser (thing that I suppose shouldn't be too hard).

Would using boost spirit for a simple stack based language like the one above pay off (since I need to learn it first before I can use it)?

回答1:

In the category "exhaustive explorations", let me add some "on the fly interpreting" stack machines using Spirit Qi (v2.x) and X3

Note that an AST-ful approach (2 stage parse/execute) is shown in the second answer

In Spirit Qi

Here the semantic actions have to be "composed" using Phoenix actors:

Live On Coliru

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/repository/include/qi_distinct.hpp>
#include <iostream>
#include <deque>

namespace qi = boost::spirit::qi;
namespace px = boost::phoenix;
namespace qr = boost::spirit::repository::qi;

using Stack = std::deque<int>;

namespace actors {

    struct pop {
        Stack& s_;

        Stack::value_type operator()() const {
            Stack::value_type v = s_.back();
            s_.pop_back();
            return v;
        }
    };

    struct push {
        Stack& s_;
        template <typename V> void operator()(V const& v) const {
            s_.push_back(v);
        }
    };

    struct dump {
        Stack& s_;
        void operator()() const {
            std::copy(s_.begin(), s_.end(), std::ostream_iterator<Stack::value_type>(std::cout, " "));
            std::cout << "\n";
        }
    };
}

int main() {
    Stack stack_;

    boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws!
    bool ok;

    {
        using namespace qi;
        px::function<actors::pop>  pop_  = actors::pop{ stack_ };
        px::function<actors::push> push_ = actors::push{ stack_ };
        px::function<actors::dump> dump_ = actors::dump{ stack_ };

        ok = phrase_parse(f, l, 
               *(
                   eps [ dump_() ] >> 
                   (lexeme [ qr::distinct(graph) [
                          lit("add") [ push_(  pop_() + pop_()) ]
                        | lit("sub") [ push_(- pop_() + pop_()) ] // bit hackish
                        | lit("mul") [ push_(pop_() * pop_()) ]
                        | lit("div") [ push_(pop_() / pop_()) ] // TODO fix order
                        | lit("pop") [ pop_() ]
                      ] ] 
                    | int_ [ push_(_1) ]
                  )
                ), space);
    }

    if (!ok)
        std::cout << "Parse failed\n";

    if (f != l)
        std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n";
}

Printing

1 
1 2 
3 
3 3 
3 3 1 
3 2 
6 

Notes:

  • it gets ugly (the learning curve is hefty; See also Boost Spirit: "Semantic actions are evil"?)
  • the comments speak volumes; actually fixing that order or operands in sub and div is not easy. It's gonna require some advanced Phoenix-fu (http://www.boost.org/doc/libs/1_59_0/libs/phoenix/doc/html/phoenix/modules/scope/let.html)

In Spirit X3

The idea is the same, but we can use proper functional composition using lambdas.

We even use a helper to dynamically generate the parser expression along with suitable binop:

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <iostream>
#include <deque>
#include <cassert>

int main() {
    std::deque<int> stack_;

    boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws!
    bool ok;

    {
        using namespace boost::spirit::x3;
        struct stack_tag {};

        auto binop = [](auto id, auto f) {
            auto apply = [=](auto& ctx) {
                auto& s = get<stack_tag>(ctx);
                assert(s.size()>=2);

                auto rhs = s.back(); s.pop_back();
                auto lhs = s.back(); s.pop_back();
                s.push_back(f(lhs, rhs));
            };

            return lexeme[as_parser(id) >> !graph] [apply];
        };

        auto push = [](auto& ctx) {
            auto& s = get<stack_tag>(ctx);
            s.push_back(_attr(ctx));
        };

        auto dump = [](auto& ctx) {
            auto& s = get<stack_tag>(ctx);
            std::copy(s.begin(), s.end(), std::ostream_iterator<int>(std::cout, " "));
            std::cout << "\n";
        };

        auto instr   = binop("add", [](auto a, auto b) { return a + b; })
                     | binop("sub", [](auto a, auto b) { return a - b; })
                     | binop("mul", [](auto a, auto b) { return a * b; })
                     | binop("div", [](auto a, auto b) { return a / b; })
                     | int_ [ push ]
                     ;

        auto parser  = skip(space) [ *(eps [ dump ] >> instr) >> eps/*post-skip*/ ];
        auto machine = with<stack_tag>(stack_) [parser];

        ok = parse(f, l, machine);
    }

    if (!ok)
        std::cout << "Parse failed\n";

    if (f != l)
        std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n";
}

Of course it prints the same output.

  • it has none of the disadvantages the Qi version has
  • it compiles much faster (2.9s versus 9.2s!)
  • Note: X3 requires C++14


回答2:

We've seen an pure standard library approach.

This executed the instructions immediately.

Let's create a parser that build an AST (Abstract Syntax Tree). In the case of our simple stack machine it's just a list of instructions. Let's call it a Tape.

Using Boost Spirit

I'd still advise against a lexer. Lexers are supported in Spirit v2 (not in X3 - yet?). But in practice they complicate matters, and Spirit knows how to backtrack the input in case of mismatch. So you can just tentatively match productions and try the next if it wasn't the right "token".

Here's what using a Spirit grammar should look like:

Tape program;
boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws!

if (parse(f, l, Parser::program, program)) {
    std::cout << "Parsed " << program.size() << " instructions\n";
} else {
    std::cout << "Parse failed\n";
}

Now, the AST types are:

struct Add {};
struct Sub {};
struct Mul {};
struct Div {};
struct Pop {};

using Value = int;
using Instr = boost::variant<Add, Sub, Mul, Div, Pop, Value>;
using Tape  = std::vector<Instr>;

Simple, right.

The grammar

In X3, making the grammar is pretty lightweight. Top-down:

auto instr   = opcode_ | int_;
auto program = skip(space) [*instr];

Now, all we have to do is teach it to recognize opcodes. A start would be:

struct opcodes : symbols<Instr> {
    opcodes() {
        this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{});
    }
} opcode_;

The seasoned Spirit guru will spot a problem here: opcode_ is not a lexeme, and neither does it guarantee "distinct identifier" parsing. E.g. "a dd" would match Add. And "additional" would also match.

Luckily, X3 makes it really easy to compose directives on the fly:

auto opcode_ = [] {
    struct opcodes : symbols<Instr> {
        opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); }
    } codes_;

    return lexeme[codes_ >> !graph];
}();

So, now both holes are fixed.

Full Demo

Live On Coliru

#include <iostream>
#include <deque>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>

struct Add {};
struct Sub {};
struct Mul {};
struct Div {};
struct Pop {};

using Value = int;
using Instr = boost::variant<Add, Sub, Mul, Div, Pop, Value>;

struct Machine {
    using result_type = void;
    std::deque<Value> stack_;

    void operator()(Instr instr) {
        boost::apply_visitor(*this, instr);
    }

    void operator()(Add) {
        assert(stack_.size()>=2);
        auto op2 = stack_.back(); stack_.pop_back();
        auto op1 = stack_.back(); stack_.pop_back();
        stack_.push_back(op1 + op2);
    }

    void operator()(Sub) {
        assert(stack_.size()>=2);
        auto op2 = stack_.back(); stack_.pop_back();
        auto op1 = stack_.back(); stack_.pop_back();
        stack_.push_back(op1 - op2);
    }

    void operator()(Mul) {
        assert(stack_.size()>=2);
        auto op2 = stack_.back(); stack_.pop_back();
        auto op1 = stack_.back(); stack_.pop_back();
        stack_.push_back(op1 * op2);
    }

    void operator()(Div) {
        assert(stack_.size()>=2);
        auto op2 = stack_.back(); stack_.pop_back();
        auto op1 = stack_.back(); stack_.pop_back();
        assert(op2 != 0);
        stack_.push_back(op1 / op2);
    }

    void operator()(Value v) {
        stack_.push_back(v);
    }

    void operator()(Pop) {
        assert(stack_.size()>=1);
        stack_.pop_back();
    }

    void trace() const {
        using namespace std;
        // debug trace
        copy(stack_.begin(), stack_.end(), ostream_iterator<Value>(cout, " "));
        cout << "\n";
    }
};

using Tape = std::vector<Instr>;

namespace Parser {
    using namespace boost::spirit::x3;

    auto opcode_ = [] {
        struct opcodes : symbols<Instr> {
            opcodes() { this->add("add", Add{})("sub", Sub{})("mul", Mul{})("div", Div{})("pop", Pop{}); }
        } codes_;

        return lexeme[codes_ >> !graph];
    }();

    auto instr   = opcode_ | int_; // TODO
    auto program = skip(space) [*instr];
}

int main() {
    Tape program;
    boost::spirit::istream_iterator f(std::cin >> std::noskipws), l; // Note the noskipws!

    if (parse(f, l, Parser::program, program)) {
        std::cout << "Parsed " << program.size() << " instructions\n";
    } else {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Unparsed program data: '" << std::string(f,l) << "'\n";

    Machine machine;
    for (auto instr : program)
    {
        machine(instr);
        machine.trace();
    }
}

Prints:

Parsed 7 instructions
1 
1 2 
3 
3 3 
3 3 1 
3 2 
6 

Summary

The main gain here is:

  • we define our grammar declaratively. For a more involved grammar, this could be a huge advantage
  • we get backtracking for free - so no need to separate tokens ahead of time

    Note: This is a PEG grammar. Backtracking only amounts to trying the next sibling alternative or failing the current rule (so the parent rule could try the next sibling alternative).

    This is significantly different from backtracking as in Regular Expression. You'll note the difference with Kleene-* en other repetitive parser expressions. In PEG grammars, these are always greedy, and never backtrack only a single element (similar to "Maximum Munch" rules).

  • We don't have any messy switch anymore. In fact it's hidden inside the Variant Visitor (see apply_visitor).

  • The instruction execution is virtually unmodified, but we renamed execute to operator() so it models the Visitor concept.



回答3:

First approach can be devillishly simple in plain c++:

int main() {
    Machine<int> machine;

    std::for_each(
            std::istream_iterator<std::string> { std::cin },
            {},
            [&](auto& instr) { machine.process(instr); }
        );
}

This exploits the fact that reading white-space separated strings is good enough as a "lexer" (tokenizer).

Now, implement process in the simplest way possible:

    static const char* opcodes[] = { "add", "sub", "mul", "div", "pop" };
    auto op = find(begin(opcodes), end(opcodes), instr);

    enum { add, sub, mul, div, pop, other };

    switch(op - opcodes) {
        case add: execute(Add{}); break;
        case sub: execute(Sub{}); break;
        case mul: execute(Mul{}); break;
        case div: execute(Div{}); break;
        case pop: execute(Pop{}); break;
        case other: {
            istringstream iss(instr);
            value_type v;
            if (iss >> v)
                execute(v);
            else
                throw runtime_error("Invalid instruction '" + instr + "'");
        }
    }

Adding some debug tracing, we get the following output for the program "1 2 add 3 1 sub mul":

Executing 1: 1 
Executing 2: 1 2 
Executing add: 3 
Executing 3: 3 3 
Executing 1: 3 3 1 
Executing sub: 3 2 
Executing mul: 6 

Live On Coliru

Using Boost Spirit

I've added it as a separate answer