Is it a good idea? For a reason I thought it should be faster than boost's tokenizer or split. however most of the time I'm stuck in the boost::spirit::compile
template <typename Iterator>
struct ValueList : bsq::grammar<Iterator, std::vector<std::string>()>
{
ValueList(const std::string& sep, bool isCaseSensitive) : ValueList::base_type(query)
{
if(isCaseSensitive)
{
query = value >> *(sep >> value);
value = *(bsq::char_ - sep);
}
else
{
auto separator = bsq::no_case[sep];
query = value >> *(separator >> value);
value = *(bsq::char_ - separator);
}
}
bsq::rule<Iterator, std::vector<std::string>()> query;
bsq::rule<Iterator, std::string()> value;
};
inline bool Split(std::vector<std::string>& result, const std::string& buffer, const std::string& separator,
bool isCaseSensitive)
{
result.clear();
ValueList<std::string::const_iterator> parser(separator, isCaseSensitive);
auto itBeg = buffer.begin();
auto itEnd = buffer.end();
if(!(bsq::parse(itBeg, itEnd, parser, result) && (itBeg == itEnd)))
result.push_back(buffer);
return true;
}
I've implemented it as shown above. What is wrong with my code? or just because the separator is defined in runtime the recompilation is inevitable?
EDIT001:
Example and comparison with possible implementation with boost::split and original imp with tokenizer on CoLiRu
Looks like coliru is down now. In any case these are result for 1M runs on string "2lkhj309|ioperwkl|20sdf39i|rjjdsf|klsdjf230o|kx23904iep2|xp39f4p2|xlmq2i3219" with separator "|"
8000000 splits in 1081ms.
8000000 splits in 1169ms.
8000000 splits in 2663ms.
first is for tokenizer, second is for boost::split and the third is for boost::spirit
First off, the different versions do not do the same thing:
boost::split
but it doesn't appear to be a feature forboost::tokenizer
)Yes, recompiles are inevitable with dynamic separators. But no, this is not the bottleneck (the other approaches have dynamic separators too):
I've done some optimizations. The timings:
Coliru clang++:
Coliru g++
Local system g++:
As you can see the Spirit approach doesn't need to be slower. What steps did I take? http://paste.ubuntu.com/11001344/
no_case[char_(delimiter)]
if required) 2.742μs.value
subrule (reduced copying and dynamic dispatch because of type-erased non-terminal rule) 2.579μs.Made delimiter charset instead of string literal: 2.693μs.
Using qi::raw[] instead of std::string synthesized attributes (avoid copying!) 0.624072μs
spirit_direct
implementation) rate: 0.491011μsNow it seems fairly obvious that all the implementations would benefit from not "compiling" the separator each time. I didn't do it for all the approaches, but for fun let's do it for the Spirit version:
Full listing:
Live On Coliru