I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream
needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream
is to:
- Read it in the
istream
a character at a time - Collect the characters
- When a delimiter is hit return the collection as a token
Regexes are very good at tokenizing strings with complex delimiters:
string foo{ "A\nB\rC\n\r" };
vector<string> bar;
// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
But I can't use a regex_iterator
on a istream
:( My solution has been to slurp the istream
and then run the regex_iterator
over it, but the slurping step seems superfluous.
Is there an unholy combination of istream_iterator
and regex_iterator
out there somewhere, or if I want it do I have to write it myself?
This question is about code appearance:
regex
will work 1 character at a time, this question is asking to use a library to parse theistream
1 character at a time rather than internally reading and parsing theistream
1 character at a timeistream
1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract thatC++11's
regex
es use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that aregex
could match using only aninput_iterator_tag
, but clearly those implemented in C++11 do not.boost::regex_iterator
on the other hand does support theboost::match_partial
flag (which is not available in C++11regex
flags.)boost::match_partial
allows the user to slurp part of the file and run theregex
over that, on a mismatch due to end of input theregex
will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like"A\nB\rC\n\r"
, this can save buffer size.boost::match_partial
has 4 drawbacks:"ABC\n"
this saves the user no size and he must slurp the wholeistream
boost
always causes bloatCircling back to answer the question: A standard library
regex_iterator
cannot operate on aninput_iterator_tag
, slurping of the wholeistream
required. Aboost::regex_iterator
allows the user to possibly slurp less than the wholeistream
. Because this is a question about code appearance though, and becauseboost::regex_iterator
's worst case requires slurping of the whole file anyway, it is not a good answer to this question.For the best code appearance slurping the whole file and running a standard
regex_iterator
over it is your best bet.I think not.
istream_iterator
has theinput_iterator_tag
tag, whereasregex_iterator
expects to be initialized using bi-directional iterators (bidirectional_iterator_tag
).If your delimiter regex is complex enough to avoid reading the stream yourself, the best way to do this is to indeed slurp the
istream
.