I am writing a piece of code to read in several GB of data that spans multiple files using C++ IOStreams, which I've chosen over the C API for a number of design reasons that I won't bore you with. Since the data is produced by a separate program on the same machine where my code will run, I am confident that issues such as those relating to endianess can, for the most part, be ignored.
The files have a reasonably complicated structure. For example, there is a header that describes the number of records of a particular binary configuration. Later in the file, I must make the code conditionally read that number of lines. This sort of pattern is repeated in a complicated, but well-documented way.
My question is related to how to do this efficiently - I'm sure my process is going to be IO-limited, so my instinct is that rather than reading in data in smallish blocks, such as the following approach
std::vector<int> buffer;
buffer.reserve(500);
file.read( (char*)&buffer[0], 500 * sizeof(int));
I should read in one file entirely at a time and try to process it in memory. So my interrelated questions:
- Given that this would seem to mean reading in a char* or std::vector array, how would you best go about converting this array into the data format required to correctly represent the file structure?
- Are my assumptions incorrect?
I know the obvious answer is to try and then to profile later, and profile I certainly will. But this question is more about how to pick the right approach at the beginning - a sort of "pick the right algorithm" optimisation, rather than the sort of optimisations that I could envisage doing after identifying bottlenecks later on!
I'll be interested in the answers offered up - I tend to only be able to find answers for relatively simple binary files, for which the approach above is suitable. My problem is that the bulk of the binary data is structured conditionally on the numbers in the header to the file (even the header is formatted this way!) so I need to be able to process the file a little more carefully.
Thanks in advance.
EDIT: Some comments coming through about memory mapping - looks good, but not sure how to do it and all I've read tells me it isn't portable. I'm interested in trying an mmap, but also in more portable solutions (if any!)