I try to parse TPCH files with Boost Spirit QI. My implementation inspired by the employee example of Spirit QI ( http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp ). The data is in csv format and the tokens are delimited with a '|' character.
It works but it is very slow (20 sec. for 1 GB).
Here is my qi grammer for the lineitem file:
struct lineitem {
int l_orderkey;
int l_partkey;
int l_suppkey;
int l_linenumber;
std::string l_quantity;
std::string l_extendedprice;
std::string l_discount;
std::string l_tax;
std::string l_returnflag;
std::string l_linestatus;
std::string l_shipdate;
std::string l_commitdate;
std::string l_recepitdate;
std::string l_shipinstruct;
std::string l_shipmode;
std::string l_comment;
};
BOOST_FUSION_ADAPT_STRUCT( lineitem,
(int, l_orderkey)
(int, l_partkey)
(int, l_suppkey)
(int, l_linenumber)
(std::string, l_quantity)
(std::string, l_extendedprice)
(std::string, l_discount)
(std::string, l_tax)
(std::string, l_returnflag)
(std::string, l_linestatus)
(std::string, l_shipdate)
(std::string, l_commitdate)
(std::string, l_recepitdate)
(std::string, l_shipinstruct)
(std::string, l_shipmode)
(std::string, l_comment))
vector<lineitem>* lineitems=new vector<lineitem>();
phrase_parse(state->dataPointer,
state->dataEndPointer,
(*(int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
int_ >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> "|" >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|' >>
+(char_ - '|') >> '|'
) ), space, *lineitems
);
The problem seems to be the character parsing. It is much slower than other conversions. Is there a better way to parse variable length tokens into strings?
Are you using -O2 when compiling?
Boosts libraries have a lot of redundancy that are removed when using optimization flags.
Another possible solution is use Repetition Parser Directive: http://www.boost.org/doc/libs/1_52_0/libs/spirit/doc/html/spirit/qi/reference/directive/repeat.html
I found a solution to my problem. As described in this post Boost Spirit QI grammar slow for parsing delimited strings the performance bottleneck is the string handling of Spirit qi. All other data types seem to be quite fast.
I avoid this problem through doing the handling of the data on my own instead of using the Spirit qi handling.
My solution uses a helper class which offers functions for every field of the csv file. The functions store the values into a struct. Strings are stored in a char[]s. Hits the parser a newline character it calls a function which adds the struct to the result vector. The Boost parser calls this functions instead of storing the values into a vector on its own.
Here is my code for the region.tbl file of the TCPH Benchmark:
It is not a pretty solution but it works and it is much faster. ( 2.2 secs for 1 GB TCPH data, multithreaded)
The problem mainly comes from appending individual
char
elements tostd::string
container. According to your grammar, for eachstd::string
attribute the allocation starts when a char is met and stops when you find a|
separator. So, at first there aresizeof(char)+1
reserved bytes (null-terminated "\0"). The compiler will have to run the allocator ofstd::string
depending on the allocators doubling algorithm! That means the memory has to be re-allocated very frequently for small strings. This means your string is copied to a memory allocation double it's size and the previous allocation is freed, at intervals of 1,2,4,6,12,24... characters. No wonder it was slow, this causes huge problems with the frequent malloc calls; more heap fragmentation, a bigger linked list of free memory blocks, variable (small) sizes of those memory blocks which at it's turn causes issues with longer scanning of memory for the application's allocations during it's entire lifetime. tldr; the data becomes fragmented and widely dispersed in the memory.Proof? The following code is called by the
char_parser
each time a valid character is met in your Iterator. From Boost 1.54/boost/spirit/home/qi/char/char_parser.hpp
/boost/spirit/home/qi/detail/assign_to.hpp
/boost/spirit/home/support/container.hpp
The correction follow-up code you posted (changing your struct to char
Name[Size]
) is basically the same as adding a stringName.reserve(Size)
statement directive. However, there's no directive for this at the moment.The Solution:
/boost/spirit/home/support/container.hpp
/boost/spirit/home/qi/char/char_parser.hpp
I haven't tested it but I assume it can speed up char parsers over string attributes by over 10x like you saw. It would be a great optimization feature in a Boost Spirit update, including a
reserve(initial_size)[ +( char_ - lit("|") ) ]
directive that sets the initial buffer size.