How to efficiently read binary data from files tha

2019-05-15 06:12发布

问题:

I am writing a piece of code to read in several GB of data that spans multiple files using C++ IOStreams, which I've chosen over the C API for a number of design reasons that I won't bore you with. Since the data is produced by a separate program on the same machine where my code will run, I am confident that issues such as those relating to endianess can, for the most part, be ignored.

The files have a reasonably complicated structure. For example, there is a header that describes the number of records of a particular binary configuration. Later in the file, I must make the code conditionally read that number of lines. This sort of pattern is repeated in a complicated, but well-documented way.

My question is related to how to do this efficiently - I'm sure my process is going to be IO-limited, so my instinct is that rather than reading in data in smallish blocks, such as the following approach

std::vector<int> buffer;
buffer.reserve(500);
file.read( (char*)&buffer[0], 500 * sizeof(int));

I should read in one file entirely at a time and try to process it in memory. So my interrelated questions:

  • Given that this would seem to mean reading in a char* or std::vector array, how would you best go about converting this array into the data format required to correctly represent the file structure?
  • Are my assumptions incorrect?

I know the obvious answer is to try and then to profile later, and profile I certainly will. But this question is more about how to pick the right approach at the beginning - a sort of "pick the right algorithm" optimisation, rather than the sort of optimisations that I could envisage doing after identifying bottlenecks later on!

I'll be interested in the answers offered up - I tend to only be able to find answers for relatively simple binary files, for which the approach above is suitable. My problem is that the bulk of the binary data is structured conditionally on the numbers in the header to the file (even the header is formatted this way!) so I need to be able to process the file a little more carefully.

Thanks in advance.

EDIT: Some comments coming through about memory mapping - looks good, but not sure how to do it and all I've read tells me it isn't portable. I'm interested in trying an mmap, but also in more portable solutions (if any!)

回答1:

Use a 64-bit OS and memory map the file. If you need to support a 32-bit OS as well, use a compatibility layer that maps chunks of the file as needed.

Alternatively, if you always need the objects in file order, just write a sane parser to handle the objects in chunks. Like this:

1) Read in 512KB of file.

2) Extract as many objects as possible from the data we read.

3) Read in as many bytes as needed to fill the buffer back up to 512KB. If we read no bytes at all, stop.

4) Go to step 3.



回答2:

You could mmap some file segments (or the entire file, at least on 64 bits machine). Perhaps use madvise and (in a separate thread) readahead



回答3:

I guess you already have enough to start off, memory mapping is certainly a neat idea as long as you have enough RAM. Else read in big chunks.

Once the data is available in memory whole file or a big chunk, the simplest way to read is to:

  • define an appropriate struct
  • create a pointer to appropriate offset in the memory where data is loaded
  • reinterpret_cast the pointer to a pointer of type "appropriate struct" or an array of appropriate struct.

You can use #pragmas to ensure the packing size/order etc if needed. But again this would be OS/Compiler dependent.



回答4:

Well, OK, the header is of variable length, but you have to start somewhere. If you have to read in the whole file first, it can get a bit messy. The whole file can be represented by a struct containing the header up until some length descriptor and then a byte array - you can start there. Once you have the header length, you can set a pointer/length to an array of header entries and then iterate them and so set a pointer/length for an array of file content structs and so on and so on..

All the various arrays of structs would probably need to be packed?

Nasty. I don't really like my own design:(

Anyone got a better idea, other than rewriting the 'separate program' to use a database or XML or something?