C++ High Performance File Reading and Writing (C++

2020-06-23 07:12发布

I’m writing a C++14 program to load text strings from a file, do some computation on them, and write back to another file. I’m using Linux, and the files are relatively large (O(10^6 lines)). My typical approach to this is to use the old C getline and sscanf utilities to read and parse the input, and fprintf(FILE*, …) to write the output files. This works, but I’m wondering if there’s a better way with the goals of high performance and generally recommended approach with the modern C++ standard that I’m using. I’ve heard that iostream is quite slow; if that’s true, I’m wondering if there’s a more recommended approach.

Update: To clarify a bit on the use case: for each line of the input file, I'll be doing some text manipulation (data cleanup, etc.). Each line is independent. So, loading the entire input file (or, at least large chunks of it), and processing it line by line, and then writing it, seems to make the most sense. The ideal abstraction for this would be to get an iterator to the read-in buffer, with each line being an entry. Is there a recommended way to do that with std::ifstream?

标签: c++ io
3条回答
Rolldiameter
2楼-- · 2020-06-23 07:50

The fastest option, if you have the memory to do it, is to read the entire file into a buffer with 1 read, process the buffer in memory, and write it all out again with 1 write.

Read it all:

std::string buffer;

std::ifstream f("file.txt");
f.seekg(0, std::ios::end);
buffer.resize(f.tellg());
f.seekg(0);
f.read(buffer.data(), buffer.size());

Then process it

Then write it all:

std::ofstream f("file.txt");
f.write(buffer.data(), buffer.size());
查看更多
ゆ 、 Hurt°
3楼-- · 2020-06-23 08:02

If you have C++17 (std::filesystem), there is also this way (which gets the file's size through std::filesystem::file_size instead of seekg and tellg). I presume this would allow you avoid reading twice

It's shown in this answer

查看更多
冷血范
4楼-- · 2020-06-23 08:09

I think you could read the file in parallel creating n threads which each have their own offset using david's method, and then pull data into separate area's which you then map to a single location. Check out ROMIO for ideas on how to maximize speed. ROMIO ideas could be done in std c++ without much trouble.

查看更多
登录 后发表回答