I have a large set of data chunks (~50GB). In my code I have to be able to do the following things:
Repeatedly iterate over all chunks and do some computations on them.
Repeatedly iterate over all chunks and do some computations on them, where in each iteration the order of visited chunks is (as far as possible) randomized.
So far, I have split the data into 10 binary files (created with boost::serialization
) and repeatedly read one after the other and perform the computations. For (2), I read the 10 files in random order and process each one in sequence, which is good enough.
However, reading the one of the files (using boost::serialization
) takes a long time and I'd like to speed it up.
Can I use memory mapped files instead of boost::serialization
?
In particular, I'd have a vector<Chunk*>
in each file. I want to be able to read in such a file very, very quickly.
How can I read/write such a vector<Chunk*>
data structure? I have looked at boost::interprocess::file_mapping
, but I'm not sure how to do it.
I read this (http://boost.cowic.de/rc/pdf/interprocess.pdf), but it doesn't say much about memory mapped files. I think I'd store the vector<Chunk*>
first in the mapped memory, then store the Chunks themselves. And, vector<Chunk*>
would actually become offset_ptr<Chunk>*
, i.e., an array of offset_ptr?