So we have a simple split:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;
vector<string> split(const string& s, const string& delim, const bool keep_empty = true) {
vector<string> result;
if (delim.empty()) {
result.push_back(s);
return result;
}
string::const_iterator substart = s.begin(), subend;
while (true) {
subend = search(substart, s.end(), delim.begin(), delim.end());
string temp(substart, subend);
if (keep_empty || !temp.empty()) {
result.push_back(temp);
}
if (subend == s.end()) {
break;
}
substart = subend + delim.size();
}
return result;
}
or boost split. And we have simple main like:
int main() {
const vector<string> words = split("close no \"\n matter\" how \n far", " ");
copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
}
how to make it oputput something like
close
no
"\n matter"
how
end symbol found.
we want to introduce to split structures
that shall be held unsplited and charecters that shall end parsing process. how to do such thing?
If your grammar contains escaped sequences, I do not believe you will be able to use simple split techniques.
You will need a state machine.
Here is some example code to give you an idea of what I mean. This solution is neither fully specified nor implied correct. I am fairly certain it has one-off errors that can only be found with thorough testing.
This sort of code is hard to reason about and maintain. That is what happen when people make crappy grammars, though. Tabs were designed to delimit fields, encourage their use when possible.
I would be ecstatic to upvote another more object oriented solution.
The following code:
generates:
Based on the examples you gave, you seemed to want newlines to count as delimiters when they appear outside of quotes and be represented by the literal
\n
when inside of quotes, so that's what this does. It also adds the ability to have multiple delimiters, such assplit_here
as I used the test.I wasn't sure if you want unmatched quotes to be split the way matched quotes do since the example you gave has the unmatched quote separated by spaces. This code treats unmatched quotes as any other character, but it should be easy to modify if this is not the behavior you want.
The line:
will work on most, if not all, implementations of the STL, but it is not gauranteed to work. It can be replaced with the safer, but slower, version:
Updated By way of 'thank you' for awarding the bonus I went and implemented 4 features that I initially skipped as "You Ain't Gonna Need It".
now supports partially quoted columns
now supports custom delimiter expressions
now supports quotes ("") inside quoted values (instead of just making them disappear)
support boost ranges in addition to containers as input (e.g. char[])
As I had half expected, you were gonna need partially quoted fields (see your comment1. Well, here you are (the bottleneck was getting it to work consistently across different versions of Boost)).
Introduction
Random notes and observations for the reader:
splitInto
template function happily supports whatever you throw at it:vector<string>
(all lines flattened)vector<vector<string>>
(tokens per line)list<list<string>>
(if you prefer)set<set<string>>
(unique linewise tokensets)\n
in output being shown as?
for comprehension (safechars
)+qi::lit(' ')
instead of the default (' '
) you will skip empty fields (i.e. repeated delimiters)Versions required/tested
This was compiled using
It works (tested) against
The Code!
The Output
Output from the sample as shown:
Update Output for your previously failing test case:
1 I must admit I had a good laugh when reading that 'it crashed' [sic]. That sounds a lot like my end-users. Just to be precise: a crash is an unrecoverable application failure. What you ran into was a handled error, and was nothing more than 'unexpected behavior' from your point of view. Anyways, that's fixed now :)