I want to efficiently parse large CSV-like files, whose order of columns I get at runtime. With Spirit Qi, I would parse each field with a lazy
auxiliary parser that would select at runtime which column-specific parser to apply to each column. But X3 doesn't seem to have lazy
(despite that it's listed in documentation). After reading recommendations here on SO, I've decided to write a custom parser.
It ended up being pretty nice, but now I've noticed I don't really need the pos
variable be exposed anywhere outside the custom parser itself. I've tried putting it into the custom parser itself and started getting compiler errors stating that the column_value_parser
object is read-only. Can I somehow put pos
into the parser structure?
Simplified code that gets the compile-time error, with commented out parts of my working version:
#include <iostream>
#include <variant>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>
namespace helpers {
// https://bitbashing.io/std-visit.html
template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}
auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);
struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;
struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
typedef boost::spirit::unused_type attribute_type;
std::vector<column_variant>& columns;
// size_t& pos;
size_t pos;
// column_value_parser(std::vector<column_variant>& columns, size_t& pos)
column_value_parser(std::vector<column_variant>& columns)
: columns(columns)
// , pos(pos)
, pos(0)
{ }
template<typename It, typename Ctx, typename Other, typename Attr>
bool parse(It& f, It l, Ctx& ctx, Other const& other, Attr& attr) const {
auto const saved_f = f;
bool successful = false;
visit(
helpers::overloaded {
[&](skip const&) {
successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
},
[&](text& c) {
std::string value;
successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
if(successful) {
std::cout << "Text: " << value << '\n';
}
},
[&](integer& c) {
int value;
successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
if(successful) {
std::cout << "Integer: " << value << '\n';
}
},
[&](real& c) {
double value;
successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
if(successful) {
std::cout << "Real: " << value << '\n';
}
}
},
columns[pos]);
if(successful) {
pos = (pos + 1) % columns.size();
return true;
} else {
f = saved_f;
return false;
}
}
};
int main(int argc, char *argv[])
{
std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";
// Comes from external source.
std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
size_t pos = 0;
boost::spirit::x3::parse(
input.begin(), input.end(),
// (column_value_parser(columns, pos) % ',') % boost::spirit::x3::eol);
(column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}
XY: My goal is to parse ~500 GB of pseudo-CSV files in a reasonable time on a machine with little RAM, convert into a list of (roughly) [row-number, column-name, value], then put into storage. The format is actually a little more complex than CSV: database dumps formatted in… human-friendly way, with column values being actually several small sublangauges (e.g. dates or, uh, something similar to whole apache log lines stuffed into a single field), and I'm often extracting only one specific part of each column. Different files may have different columns and in different order, which I can only learn by parsing yet another set of files containing original queries. Thankfully, Spirit makes it a breeze…
Three answers:
pos
amutable
memberx3::with<>
1. Making
pos
mutableLive On Wandbox
2.
x3::with<>
This is similar but with better (re)entrancy and encapsulation:
Live On Wandbox
3. Functional Composition
Because it's so much easier in X3, my favourite is to just generate the parser on demand.
Without requirements, this is the simplest I'd propose:
Live On Wandbox
A version with debug information enabled:
Live On Wandbox
Notes, Caveats:
With anything
mutable
, beware of side-effects. E.g. if you havea | b
anda
includescolumn_value_parser
, the side-effect of incrementingpos
will not be rolled back whena
fails andb
is matched instead.In short, this makes your parse function impure.