I have a file with millions of lines, each line has 3 floats separated by spaces. It takes a lot of time to read the file, so I tried to read them using memory mapped files only to find out that the problem is not with the speed of IO but with the speed of the parsing.
My current parsing is to take the stream (called file) and do the following
float x,y,z;
file >> x >> y >> z;
Someone in Stack Overflow recommended to use Boost.Spirit but I couldn't find any simple tutorial to explain how to use it.
I'm trying to find a simple and efficient way to parse a line that looks like this:
"134.32 3545.87 3425"
I will really appreciate some help. I wanted to use strtok to split it, but I don't know how to convert strings to floats, and I'm not quite sure it's the best way.
I don't mind if the solution will be Boost or not. I don't mind if it won't be the most efficient solution ever, but I'm sure that it is possible to double the speed.
Thanks in advance.
Summary:
Spirit parsers are fastest. If you can use C++14 consider the experimental version Spirit X3:
The above is measures using memory mapped files. Using IOstreams, it will be slower accross the board,
but not as slow as
scanf
using C/POSIXFILE*
function calls:What follows is parts from the OLD answer
Environment:
Full Code
Full code to the old benchmark is in the edit history of this post, the newest version is on github
a nitty-gritty solution would be to throw more cores at the problem, spawning multiple threads. If the bottleneck is just the CPU you can halve down the running time by spawning two threads (on multicore CPUs)
some other tips:
try to avoid parsing functions from library such boost and/or std. They are bloated with error checking conditions and much of the processing time is spent doing these checks. For just a couple conversions they are fine but fail miserably when it comes to process millions of values. If you already know that your data is well-formatted you can write (or find) a custom optimized C function which does only the data conversion
use a large memory buffer (let's say 10 Mbytes) in which you load chunks of your file and do the conversion on there
divide et impera: split your problem into smaller easier ones: preprocess your file, make it single line single float, split each line by the "." character and convert integers instead of float, then merge the two integers to create the float number
using C is going to be the fastest solution.
Split into tokens usingconvert to float withstrtok
and thenstrtof
. Or if you know the exact format usefscanf
.I would check out this related post Using ifstream to read floats or How do I tokenize a string in C++ particularly the posts related to C++ String Toolkit Library. I've used C strtok, C++ streams, Boost tokenizer and the best of them for the ease and use is C++ String Toolkit Library.
If the conversion is the bottle neck (which is quite possible), you should start by using the different possiblities in the standard. Logically, one would expect them to be very close, but practically, they aren't always:
You've already determined that
std::ifstream
is too slow.Converting your memory mapped data to an
std::istringstream
is almost certainly not a good solution; you'll first have to create a string, which will copy all of the data.Writing your own
streambuf
to read directly from the memory, without copying (or using the deprecatedstd::istrstream
) might be a solution, although if the problem really is the conversion... this still uses the same conversion routines.You can always try
fscanf
, orscanf
on your memory mapped stream. Depending on the implementation, they might be faster than the variousistream
implementations.Probably faster than any of these is to use
strtod
. No need to tokenize for this:strtod
skips leading white space (including'\n'
), and has an out parameter where it puts the address of the first character not read. The end condition is a bit tricky, your loop should probably look a bit like:If none of these are fast enough, you'll have to consider the actual data. It probably has some sort of additional constraints, which means that you can potentially write a conversion routine which is faster than the more general ones; e.g.
strtod
has to handle both fixed and scientific, and it has to be 100% accurate even if there are 17 significant digits. It also has to be locale specific. All of this is added complexity, which means added code to execute. But beware: writing an efficient and correct conversion routine, even for a restricted set of input, is non-trivial; you really do have to know what you are doing.EDIT:
Just out of curiosity, I've run some tests. In addition to the afore mentioned solutions, I wrote a simple custom converter, which only handles fixed point (no scientific), with at most five digits after the decimal, and the value before the decimal must fit in an
int
:(If you actually use this, you should definitely add some error handling. This was just knocked up quickly for experimental purposes, to read the test file I'd generated, and nothing else.)
The interface is exactly that of
strtod
, to simplify coding.I ran the benchmarks in two environments (on different machines, so the absolute values of any times aren't relevant). I got the following results:
Under Windows 7, compiled with VC 11 (/O2):
Under Linux 2.6.18, compiled with g++ 4.4.2 (-O2, IIRC):
In all cases, I'm reading 554000 lines, each with 3 randomly generated floating point in the range
[0...10000)
.The most striking thing is the enormous difference between
fstream
andfscan
under Windows (and the relatively small difference betweenfscan
andstrtod
). The second thing is just how much the simple custom conversion function gains, on both platforms. The necessary error handling would slow it down a little, but the difference is still significant. I expected some improvement, since it doesn't handle a lot of things the the standard conversion routines do (like scientific format, very, very small numbers, Inf and NaN, i18n, etc.), but not this much.Before you start, verify that this is the slow part of your application and get a test harness around it so you can measure improvements.
boost::spirit
would be overkill for this in my opinion. Tryfscanf