Splitting string using boost spirit

2019-04-16 10:57发布

Is it a good idea? For a reason I thought it should be faster than boost's tokenizer or split. however most of the time I'm stuck in the boost::spirit::compile

template <typename Iterator>
struct ValueList : bsq::grammar<Iterator, std::vector<std::string>()>
{
    ValueList(const std::string& sep, bool isCaseSensitive) : ValueList::base_type(query)
    {
        if(isCaseSensitive)
        {
            query = value >> *(sep >> value);
            value = *(bsq::char_ - sep);
        }
        else
        {
            auto separator = bsq::no_case[sep];
            query = value >> *(separator >> value);
            value = *(bsq::char_ - separator);
        }
    }
    bsq::rule<Iterator, std::vector<std::string>()> query;
    bsq::rule<Iterator, std::string()> value;
};

inline bool Split(std::vector<std::string>& result, const std::string& buffer, const std::string& separator,
                  bool isCaseSensitive)
{
    result.clear();
    ValueList<std::string::const_iterator> parser(separator, isCaseSensitive);
    auto itBeg = buffer.begin();
    auto itEnd = buffer.end();
    if(!(bsq::parse(itBeg, itEnd, parser, result) && (itBeg == itEnd)))
        result.push_back(buffer);
    return true;
}

I've implemented it as shown above. What is wrong with my code? or just because the separator is defined in runtime the recompilation is inevitable?

EDIT001: Example and comparison with possible implementation with boost::split and original imp with tokenizer on CoLiRu Looks like coliru is down now. In any case these are result for 1M runs on string "2lkhj309|ioperwkl|20sdf39i|rjjdsf|klsdjf230o|kx23904iep2|xp39f4p2|xlmq2i3219" with separator "|"

8000000 splits in 1081ms.
8000000 splits in 1169ms.
8000000 splits in 2663ms.

first is for tokenizer, second is for boost::split and the third is for boost::spirit

1条回答
一夜七次
2楼-- · 2019-04-16 11:35

First off, the different versions do not do the same thing:

  • in particular token compression behaves differently (I fixed it for boost::split but it doesn't appear to be a feature for boost::tokenizer)
  • separators were treated as string literals rather than charsets in the Spirit version (fixed that)

Yes, recompiles are inevitable with dynamic separators. But no, this is not the bottleneck (the other approaches have dynamic separators too):


I've done some optimizations. The timings:

  • Coliru clang++:

    8000000 original (boost::tokenizer) rate: 2.84257μs
    10000000 possible (boost::split) rate: 3.09941μs
    10000000 spirit (dynamic version) rate: 1.45456μs
    10000000 spirit (direct version, avoid type erasure) rate: 1.25588μs
    
    next step:
    10000000 spirit (precompiled sep) rate: 1.18059μs
    
  • Coliru g++

    8000000 original (boost::tokenizer) rate: 2.92805μs
    10000000 possible (boost::split) rate: 2.75442μs
    10000000 spirit (dynamic version) rate: 1.32821μs
    10000000 spirit (direct version, avoid type erasure) rate: 1.10712μs
    
    next step:
    10000000 spirit (precompiled sep) rate: 1.0791μs
    
  • Local system g++:

    sehe@desktop:/tmp$ time ./test
    8000000 original (boost::tokenizer) rate: 1.80061μs
    10000000 possible (boost::split) rate: 1.29754μs
    10000000 spirit (dynamic version) rate: 0.607789μs
    10000000 spirit (direct version, avoid type erasure) rate: 0.488087μs
    
    next step:
    10000000 spirit (precompiled sep) rate: 0.498769μs
    

As you can see the Spirit approach doesn't need to be slower. What steps did I take? http://paste.ubuntu.com/11001344/

  • refactor benchmark to display rate (no change) 3.523μs
  • made case_insensitivity choice of caller (just use no_case[char_(delimiter)] if required) 2.742μs.
  • eliminate value subrule (reduced copying and dynamic dispatch because of type-erased non-terminal rule) 2.579μs.
  • Made delimiter charset instead of string literal: 2.693μs.

    see intermediate version on coliru. (I recommend the code below, it's much more cleaned up)

  • Using qi::raw[] instead of std::string synthesized attributes (avoid copying!) 0.624072μs

  • Eliminating all non-terminals (i.e. type-erasure; see spirit_direct implementation) rate: 0.491011μs

Now it seems fairly obvious that all the implementations would benefit from not "compiling" the separator each time. I didn't do it for all the approaches, but for fun let's do it for the Spirit version:

  • Using hardcoded '|' delimiter 0.455269μs // fixed delimiter

Full listing:

Live On Coliru

#include <boost/algorithm/string.hpp>
#include <boost/tokenizer.hpp>
#include <boost/spirit/include/qi.hpp>

#include <vector>
#include <string>
#include <chrono>
#include <iostream>

void original(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
    result.clear();
    boost::char_separator<char> sep(delimiter.c_str());
    boost::tokenizer<boost::char_separator<char>, std::string::const_iterator, std::string> tok(input, sep);

    for (const auto& token : tok)
    {
        result.push_back(token);
    }
}

void possible(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
    result.clear();
    boost::split(result, input, boost::is_any_of(delimiter), boost::algorithm::token_compress_off);
}

namespace bsq = boost::spirit::qi;

void spirit_direct(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
    result.clear();
    using namespace bsq;
    if (!parse(input.begin(), input.end(), raw[*(char_ - char_(delimiter))] % char_(delimiter), result))
        result.push_back(input);
}

namespace detail {
    template <typename Sep> bsq::rule<std::string::const_iterator, std::vector<std::string>()>
        make_spirit_parser(Sep const& sep) 
        {
            using namespace bsq;
            return raw[*(char_ - sep)] % sep;
        }

    static const auto precompiled_pipes = make_spirit_parser('|');
}

void spirit(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
    result.clear();
    if (!bsq::parse(input.begin(), input.end(), detail::make_spirit_parser(bsq::char_(delimiter)), result))
        result.push_back(input);
}

void spirit_pipes(std::vector<std::string>& result, const std::string& input)
{
    result.clear();
    if (!bsq::parse(input.begin(), input.end(), detail::precompiled_pipes, result))
        result.push_back(input);
}

template <typename F> void bench(std::string const& caption, F approach) {
    size_t const iterations = 1000000;
    using namespace std::chrono;
    using C = high_resolution_clock;

    auto start = C::now();

    size_t count = 0;
    for (auto i = 0U; i < iterations; ++i) {
        count += approach();
    }

    auto us = duration_cast<std::chrono::microseconds>(C::now() - start).count();
    std::cout << count << " " << caption << " rate: " << (1.*us/iterations) << "μs\n";
}

int main()
{
    std::string const input = "2309|ioperwkl|2039i|rjjdsf|klsdjf230o|kx23904iep2|xp,39,4p2|xlmq2i3219||";
    auto separator = "|";

    std::vector<std::string> result;

    bench("original (boost::tokenizer)", [&] {
            original(result, input, separator);
            return result.size();
        });
    bench("possible (boost::split)", [&] {
            possible(result, input, separator);
            return result.size();
        });
    bench("spirit (dynamic version)", [&] {
            spirit(result, input, separator);
            return result.size();
        });
    bench("spirit (direct version, avoid type erasure)", [&] {
            spirit_direct(result, input, separator);
            return result.size();
        });

    std::cout << "\nnext step:\n";
    bench("spirit (precompiled sep)", [&] {
            spirit_pipes(result, input);
            return result.size();
        });
}
查看更多
登录 后发表回答