I’m writing a C++14 program to load text strings from a file, do some computation on them, and write back to another file. I’m using Linux, and the files are relatively large (O(10^6 lines)). My typical approach to this is to use the old C getline
and sscanf
utilities to read and parse the input, and fprintf(FILE*, …)
to write the output files. This works, but I’m wondering if there’s a better way with the goals of high performance and generally recommended approach with the modern C++ standard that I’m using. I’ve heard that iostream
is quite slow; if that’s true, I’m wondering if there’s a more recommended approach.
Update: To clarify a bit on the use case: for each line of the input file, I'll be doing some text manipulation (data cleanup, etc.). Each line is independent. So, loading the entire input file (or, at least large chunks of it), and processing it line by line, and then writing it, seems to make the most sense. The ideal abstraction for this would be to get an iterator to the read-in buffer, with each line being an entry. Is there a recommended way to do that with std::ifstream?
The fastest option, if you have the memory to do it, is to read the entire file into a buffer with 1 read, process the buffer in memory, and write it all out again with 1 write.
Read it all:
Then process it
Then write it all:
If you have C++17 (std::filesystem), there is also this way (which gets the file's size through std::filesystem::file_size instead of seekg and tellg). I presume this would allow you avoid reading twice
It's shown in this answer
I think you could read the file in parallel creating n threads which each have their own offset using david's method, and then pull data into separate area's which you then map to a single location. Check out ROMIO for ideas on how to maximize speed. ROMIO ideas could be done in std c++ without much trouble.