C++ random access iterators for containers with el

2019-07-04 00:08发布

问题:

I'm currently working on a small project which requires loading messages from a file. The messages are stored sequentially in the file and files can become huge, so loading the entire file content into memory is unrewarding.

Therefore we decided to implement a FileReader class that is capable of moving to specific elements in the file quickly and load them on request. Commonly used something along the following lines

SpecificMessage m;
FileReader fr;
fr.open("file.bin");
fr.moveTo(120); // Move to Message #120
fr.read(&m);    // Try deserializing as SpecificMessage 

The FileReader per se works great. Therefore we thought about adding STL compliant iterator support as well: A random access iterator that provides read-only references to specific messages. Used in the following way

for (auto iter = fr.begin<SpecificMessage>(); iter != fr.end<SpecificMessage>(); ++iter) {
  // ...
}

Remark: the above assumes that the file only contains messages of type SpecificMessage. We've been using boost::iterator_facade to simplify the implementation.

Now my question boils down to: how to implement the iterator correctly? Since FileReader does not actually hold a sequence of messages internally, but loads them on request.

What we've tried so far:

Storing the message as an iterator member

This approach stores the message in the iterator instance. Which works great for simple use-cases but fails for more complex uses. E.g. std::reverse_iterator has a dereference operation that looks like this

 reference operator*() const
 {  // return designated value
   _RanIt _Tmp = current;
   return (*--_Tmp);
 }

This breaks our approach as a reference to a message from a temporary iterator is returned.

Making the reference type equal the value type

@DDrmmr in the comments suggested making the reference type equal the value type, so that a copy of the internally stored object is returned. However, I think this is not valid for the reverse iterator which implements the -> operator as

pointer operator->() const {
  return (&**this);
}

which derefs itself, calls the *operator which then returns a copy of a temporary and finally returns the address of this temporary.

Storing the message externally

Alternatively I though about storing the message externally:

SpecificMessage m;
auto iter = fr.begin<SpecificMessage>(&m);
// ...

which also seems to be flawed for

auto iter2 = iter + 2

which will have both iter2 and iter point to the same content.

回答1:

As I hinted in my other answer, you could consider using memory mapped files. In the comment you asked:

As far as memory mapped files is concerned, this seems not what I want to have, as how would you provide an iterator over SpecificMessages for them?

Well, if your SpecificMessage is a POD type, you could just iterate over the raw memory directly. If not, you could have a deserialization helper (as you already have) and use Boost transform_iterator to do the deserialization on demand.

Note that we can make the memory mapped file managed, effectively meaning that you can just use it as a regular heap, and you can store all standard containers. This includes node-based containers (map<>, e.g.), dynamic-size containers (e.g. vector<>) in addition to the fixed-size containers (array<>) - and any combinations of those.

Here's a demo that takes a simple SpecificMessage that contains a string, and (de)derializes it directly into shared memory:

using blob_t       = shm::vector<uint8_t>;
using shared_blobs = shm::vector<blob_t>;

The part that interests you would be the consuming part:

bip::managed_mapped_file mmf(bip::open_only, DBASE_FNAME);
shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());

using It = boost::transform_iterator<LazyLoader<SpecificMessage>, shared_blobs::const_reverse_iterator>;

// for fun, let's reverse the blobs
for (It first(table->rbegin()), last(table->rend()); first < last; first+=13)
    std::cout << "blob: '" << first->contents << "'\n";

// any kind of random access is okay, though:
auto random = rand() % table->size();
SpecificMessage msg;
load(table->at(random), msg);
std::cout << "Random blob #" << random << ": '" << msg.contents << "'\n";

So this prints each 13th message, in reverse order, followed by a random blob.

Full Demo

The sample online uses the lines of the sources as "messages".

Live On Coliru

#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/managed_mapped_file.hpp>
#include <boost/container/scoped_allocator.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <iostream>

#include <boost/iterator/transform_iterator.hpp>
#include <boost/range/iterator_range.hpp>

static char const* DBASE_FNAME = "database.map";

namespace bip = boost::interprocess;

namespace shm {
    using segment_manager = bip::managed_mapped_file::segment_manager;
    template <typename T> using allocator = boost::container::scoped_allocator_adaptor<bip::allocator<T, segment_manager> >;
    template <typename T> using vector    = bip::vector<T, allocator<T> >;
}

using blob_t       = shm::vector<uint8_t>;
using shared_blobs = shm::vector<blob_t>;

struct SpecificMessage {
    // for demonstration purposes, just a string; could be anything serialized
    std::string contents;

    // trivial save/load serialization code:
    template <typename Blob>
    friend bool save(Blob& blob, SpecificMessage const& msg) {
        blob.assign(msg.contents.begin(), msg.contents.end());
        return true;
    }

    template <typename Blob>
    friend bool load(Blob const& blob, SpecificMessage& msg) {
        msg.contents.assign(blob.begin(), blob.end());
        return true;
    }
};

template <typename Message> struct LazyLoader {
    using type = Message;

    Message operator()(blob_t const& blob) const {
        Message result;
        if (!load(blob, result)) throw std::bad_cast(); // TODO custom excepion
        return result;
    }
};

///////
// for demo, create some database contents
void create_database_file() {
    bip::file_mapping::remove(DBASE_FNAME);
    bip::managed_mapped_file mmf(bip::open_or_create, DBASE_FNAME, 1ul<<20); // Even sparse file size is limited on Coliru

    shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());

    std::ifstream ifs("main.cpp");
    std::string line;
    while (std::getline(ifs, line)) {
        table->emplace_back();
        save(table->back(), SpecificMessage { line });
    }

    std::cout << "Created blob table consisting of " << table->size() << " blobs\n";
}

///////

void display_random_messages() {
    bip::managed_mapped_file mmf(bip::open_only, DBASE_FNAME);
    shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());

    using It = boost::transform_iterator<LazyLoader<SpecificMessage>, shared_blobs::const_reverse_iterator>;

    // for fun, let's reverse the blobs
    for (It first(table->rbegin()), last(table->rend()); first < last; first+=13)
        std::cout << "blob: '" << first->contents << "'\n";

    // any kind of random access is okay, though:
    auto random = rand() % table->size();
    SpecificMessage msg;
    load(table->at(random), msg);
    std::cout << "Random blob #" << random << ": '" << msg.contents << "'\n";
}

int main()
{
#ifndef CONSUMER_ONLY
    create_database_file();
#endif

    srand(time(NULL));
    display_random_messages();
}


回答2:

You are having issues because your iterator does not conform to the forward iterator requirements. Specifically:

  • *i must be an lvalue reference to value_type or const value_type ([forward.iterators]/1.3)
  • *i cannot be a reference to an object stored in the iterator itself, due to the requirement that two iterators are equal if and only if they are bound to the same object ([forward.iterators]/6)

Yes, these requirements are a huge pain in the butt, and yes, that means that things like std::vector<bool>::iterator are not random access iterators even though some standard library implementations incorrectly claim that they are.


EDIT: The following suggested solution is horribly broken, in that dereferencing a temporary iterator returns a reference to an object that may not live until the reference is used. For example, after auto& foo = *(i + 1); the object referenced by foo may have been released. The implementation of reverse_iterator referenced in the OP will cause the same problem.

I'd suggest that you split your design into two classes: FileCache that holds the file resources and a cache of loaded messages, and FileCache::iterator that holds a message number and lazily retrieves it from the FileCache when dereferenced. The implementation could be something as simple as storing a container of weak_ptr<Message> in FileCache and a shared_ptr<Message> in the iterator: Simple demo



回答3:

I have to admit I may not fully understand the trouble you have with holding the current MESSAGE as a member of Iter. I would associate each iterator with the FileReader it should read from and implement it as a lightweight encapsulation of a read index for FileReader::(read|moveTo). The most important method to overwtite is boost::iterator_facade<...>::advance(...) which modifies the current index and tries to pull a new MESSAGE from the FileReader If this fails it flags the the iterator as invalid and dereferencing will fail.

template<class MESSAGE,int STEP>           
class message_iterator; 

template<class MESSAGE> 
class FileReader { 
public: 
    typedef message_iterator<MESSAGE, 1> const_iterator; 
    typedef message_iterator<MESSAGE,-1> const_reverse_iterator; 

    FileReader(); 
    bool open(const std::string  & rName); 
    bool moveTo(int n); 
    bool read(MESSAGE &m); 

    // get the total count of messages in the file 
    // helps us to find end() and rbegin() 
    int getMessageCount(); 

    const_iterator begin() {                                           
        return const_iterator(this,0); 
    } 
    const_iterator end() { 
        return const_iterator(this,getMessageCount()); 
    } 
    const_reverse_iterator rbegin() { 
        return const_reverse_iterator(this,getMessageCount()-1); 
    } 
    const_reverse_iterator rend() { 
        return const_reverse_iterator(this,-1); 
    } 
}; 

// declaration of message_iterator moving over MESSAGE 
// STEP is used to specify STEP size and direction (e.g -1 == reverse) 
template<class MESSAGE,int STEP=1>                                                 
class message_iterator 
    : public boost::iterator_facade< 
    message_iterator<MESSAGE> 
    , const MESSAGE  
    , boost::random_access_traversal_tag 
    > 
{ 
    typedef  boost::iterator_facade< 
        message_iterator<MESSAGE> 
        , const MESSAGE 
        , boost::random_access_traversal_tag 
        > super; 

public:                                                               
    // constructor associates an iterator with its FileReader and a given position 
    explicit message_iterator(FileReader<MESSAGE> * p=NULL,int n=0): _filereader(p),_idx(n),_valid(false)    { 
        advance(0); 
    } 
    bool equal(const message_iterator & i) const { 
        return i._filereader == _filereader && i._idx == _idx; 
    } 
    void increment() { 
        advance(+1); 
    } 
    void decrement() { 
        advance(-1); 
    } 

    // overwrite with central functionality. Move to a given relative 
    // postion and check wether the position can be read. If move/read 
    // fails we flag the iterator as incalid. 

    void advance(int n) { 
        _idx += n*STEP; 
        if(_filereader!=NULL) { 
            if( _filereader->moveTo( _idx ) && _filereader->read(_m)) { 
                _valid = true; 
                return; 
            } 
        } 
        _valid = false; 
    } 
    // Return a ref to the currently cached MESSAGE. Throw 
    // an acception if positioning at this location in advance(...) failes. 
    typename super::reference dereference() const { 
        if(!_valid) { 
            throw std::runtime_error("access to invalid pos"); 
        } 
        return _m; 
    } 

private: 
    FileReader<MESSAGE> * _filereader; 
    int                   _idx; 
    bool                  _valid; 
    MESSAGE               _m; 
}; 


回答4:

Boost PropertyMap

You could avoid writing the bulk of the code using Boost PropertyMap:

Live On Coliru

#include <boost/property_map/property_map.hpp>
#include <boost/property_map/function_property_map.hpp>

using namespace boost;

struct SpecificMessage {
    // add some data
    int index; // just for demo
};

template <typename Message>
struct MyLazyReader {
    typedef Message type;
    std::string fname;

    MyLazyReader(std::string fname) : fname(fname) {}

    Message operator()(size_t index) const { 
        Message m;
        // FileReader fr;
        // fr.open(fname);
        // fr.moveTo(index);     // Move to Message 
        // fr.read(&m);          // Try deserializing as SpecificMessage  
        m.index = index; // just for demo
        return m;
    }
};

#include <iostream>

int main() {

    auto lazy_access = make_function_property_map<size_t>(MyLazyReader<SpecificMessage>("file.bin"));

    for (int i=0; i<10; ++i)
        std::cout << lazy_access[rand()%256].index << "\n";
}

Sample output is

103
198
105
115
81
255
74
236
41
205

Using Memory Mapped Files

You could store a map of index -> BLOB objects in a shared vector<array<byte, N>>, flat_map<size_t, std::vector<uint8_t> > or similar.

So, now you only have to deserialize from myshared_map[index].data() (begin() and end() in case the BLOB size varies)