Is it a good idea? For a reason I thought it should be faster than boost's tokenizer or split. however most of the time I'm stuck in the boost::spirit::compile
template <typename Iterator>
struct ValueList : bsq::grammar<Iterator, std::vector<std::string>()>
{
ValueList(const std::string& sep, bool isCaseSensitive) : ValueList::base_type(query)
{
if(isCaseSensitive)
{
query = value >> *(sep >> value);
value = *(bsq::char_ - sep);
}
else
{
auto separator = bsq::no_case[sep];
query = value >> *(separator >> value);
value = *(bsq::char_ - separator);
}
}
bsq::rule<Iterator, std::vector<std::string>()> query;
bsq::rule<Iterator, std::string()> value;
};
inline bool Split(std::vector<std::string>& result, const std::string& buffer, const std::string& separator,
bool isCaseSensitive)
{
result.clear();
ValueList<std::string::const_iterator> parser(separator, isCaseSensitive);
auto itBeg = buffer.begin();
auto itEnd = buffer.end();
if(!(bsq::parse(itBeg, itEnd, parser, result) && (itBeg == itEnd)))
result.push_back(buffer);
return true;
}
I've implemented it as shown above. What is wrong with my code? or just because the separator is defined in runtime the recompilation is inevitable?
EDIT001:
Example and comparison with possible implementation with boost::split and original imp with tokenizer on CoLiRu
Looks like coliru is down now. In any case these are result for 1M runs on string "2lkhj309|ioperwkl|20sdf39i|rjjdsf|klsdjf230o|kx23904iep2|xp39f4p2|xlmq2i3219" with separator "|"
8000000 splits in 1081ms.
8000000 splits in 1169ms.
8000000
splits in 2663ms.
first is for tokenizer, second is for boost::split and the third is for boost::spirit
First off, the different versions do not do the same thing:
- in particular token compression behaves differently (I fixed it for
boost::split
but it doesn't appear to be a feature for boost::tokenizer
)
- separators were treated as string literals rather than charsets in the Spirit version (fixed that)
Yes, recompiles are inevitable with dynamic separators. But no, this is not the bottleneck (the other approaches have dynamic separators too):
I've done some optimizations. The timings:
Coliru clang++:
8000000 original (boost::tokenizer) rate: 2.84257μs
10000000 possible (boost::split) rate: 3.09941μs
10000000 spirit (dynamic version) rate: 1.45456μs
10000000 spirit (direct version, avoid type erasure) rate: 1.25588μs
next step:
10000000 spirit (precompiled sep) rate: 1.18059μs
Coliru g++
8000000 original (boost::tokenizer) rate: 2.92805μs
10000000 possible (boost::split) rate: 2.75442μs
10000000 spirit (dynamic version) rate: 1.32821μs
10000000 spirit (direct version, avoid type erasure) rate: 1.10712μs
next step:
10000000 spirit (precompiled sep) rate: 1.0791μs
Local system g++:
sehe@desktop:/tmp$ time ./test
8000000 original (boost::tokenizer) rate: 1.80061μs
10000000 possible (boost::split) rate: 1.29754μs
10000000 spirit (dynamic version) rate: 0.607789μs
10000000 spirit (direct version, avoid type erasure) rate: 0.488087μs
next step:
10000000 spirit (precompiled sep) rate: 0.498769μs
As you can see the Spirit approach doesn't need to be slower. What steps did I take? http://paste.ubuntu.com/11001344/
Now it seems fairly obvious that all the implementations would benefit from not "compiling" the separator each time. I didn't do it for all the approaches, but for fun let's do it for the Spirit version:
- Using hardcoded '|' delimiter 0.455269μs // fixed delimiter
Full listing:
Live On Coliru
#include <boost/algorithm/string.hpp>
#include <boost/tokenizer.hpp>
#include <boost/spirit/include/qi.hpp>
#include <vector>
#include <string>
#include <chrono>
#include <iostream>
void original(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::char_separator<char> sep(delimiter.c_str());
boost::tokenizer<boost::char_separator<char>, std::string::const_iterator, std::string> tok(input, sep);
for (const auto& token : tok)
{
result.push_back(token);
}
}
void possible(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::split(result, input, boost::is_any_of(delimiter), boost::algorithm::token_compress_off);
}
namespace bsq = boost::spirit::qi;
void spirit_direct(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
using namespace bsq;
if (!parse(input.begin(), input.end(), raw[*(char_ - char_(delimiter))] % char_(delimiter), result))
result.push_back(input);
}
namespace detail {
template <typename Sep> bsq::rule<std::string::const_iterator, std::vector<std::string>()>
make_spirit_parser(Sep const& sep)
{
using namespace bsq;
return raw[*(char_ - sep)] % sep;
}
static const auto precompiled_pipes = make_spirit_parser('|');
}
void spirit(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::make_spirit_parser(bsq::char_(delimiter)), result))
result.push_back(input);
}
void spirit_pipes(std::vector<std::string>& result, const std::string& input)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::precompiled_pipes, result))
result.push_back(input);
}
template <typename F> void bench(std::string const& caption, F approach) {
size_t const iterations = 1000000;
using namespace std::chrono;
using C = high_resolution_clock;
auto start = C::now();
size_t count = 0;
for (auto i = 0U; i < iterations; ++i) {
count += approach();
}
auto us = duration_cast<std::chrono::microseconds>(C::now() - start).count();
std::cout << count << " " << caption << " rate: " << (1.*us/iterations) << "μs\n";
}
int main()
{
std::string const input = "2309|ioperwkl|2039i|rjjdsf|klsdjf230o|kx23904iep2|xp,39,4p2|xlmq2i3219||";
auto separator = "|";
std::vector<std::string> result;
bench("original (boost::tokenizer)", [&] {
original(result, input, separator);
return result.size();
});
bench("possible (boost::split)", [&] {
possible(result, input, separator);
return result.size();
});
bench("spirit (dynamic version)", [&] {
spirit(result, input, separator);
return result.size();
});
bench("spirit (direct version, avoid type erasure)", [&] {
spirit_direct(result, input, separator);
return result.size();
});
std::cout << "\nnext step:\n";
bench("spirit (precompiled sep)", [&] {
spirit_pipes(result, input);
return result.size();
});
}