I have a rather large amount of data to analyse - each file is about 5gigs. Each file is of the following format:
xxxxx yyyyy
Both key and value can repeat, but the keys are sorted in increasing order. I'm trying to use a memory mapped file for this purpose and then find the required keys and work with them. This is what I've written:
if (data_file != "")
{
clock_start = clock();
data_file_mapped.open(data_file);
data_multimap = (std::multimap<double, unsigned int> *)data_file_mapped.data();
if (data_multimap != NULL)
{
std::multimap<double, unsigned int>::iterator it = data_multimap->find(keys_to_find[4]);
if (it != data_multimap->end())
{
std::cout << "Element found.";
for (std::multimap<double, unsigned int>::iterator it = data_multimap->lower_bound(keys_to_find[4]); it != data_multimap->upper_bound(keys_to_find[5]); ++it)
{
std::cout << it->second;
}
std::cout << "\n";
clock_end = clock();
std::cout << "Time taken to read in the file: " << (clock_end - clock_start)/CLOCKS_PER_SEC << "\n";
}
else
std::cerr << "Element not found at all" << "\n";
}
else
std::cerr << "Nope - no data received."<< "\n";
}
Basically, I need to locate ranges of keys and pull those chunks out to work on. I get a segfault the first time I try to use a method on the multimap. For example, when the find
method is called. I tried the upper_bound
, lower_bound
and other methods too, and still get a segfault.
This is what gdb
gives me:
Program received signal SIGSEGV, Segmentation fault.
_M_lower_bound (this=<optimized out>, __k=<optimized out>, __y=<optimized out>, __x=0xa31202030303833) at /usr/include/c++/4.9.2/bits/stl_tree.h:1261
1261 if (!_M_impl._M_key_compare(_S_key(__x), __k))
Could someone please point out what I'm doing wrong? I've only been able to find simplistic examples on memory mapped files - nothing like this yet. Thanks.
EDIT: More information:
The file I described above is basically a two column plain text file which a neural simulator gives me as the output of my simulations. It's simple like this:
$ du -hsc 201501271755.e.ras
4.9G 201501271755.e.ras
4.9G total
$ head 201501271755.e.ras
0.013800 0
0.013800 1
0.013800 10
0.013800 11
0.013800 12
0.013800 13
0.013800 14
0.013800 15
0.013800 16
0.013800 17
The first column is time, the second column is the neurons that fired at this time - (it's a spike time raster file). Actually, the output is a file like this from each MPI rank that is being used to run the simulation. The various files have been combined to this master file using sort -g -m
. More information on the file format is here: http://www.fzenke.net/auryn/doku.php?id=manual:ras
To calculate the firing rate and other metrics of the neuron set at certain times of the simulation, I need to - locate the time in the file, pull out a chunk between [time -1,time] and run some metrics and so on on this chunk. This file is quite small and I expect the size to increase quite a bit as my simulations get more and more complicated and run for longer time periods. It's why I began looking into memory mapped files. I hope that clarifies the problem statement somewhat. I only need to read the output file to process the information it contains. I do not need to modify this file at all.
To process the data, I'll use multiple threads again, but since all my operations on the map are read-only, I don't expect to run into trouble there.
You're trying things you don't really understand :) No problem.
Multi maps aren't laid out sequentially in memory. (They're node-based containers, but I digress). In fact, even if they were, chances would be slim that the layout would match that of the text input.
There's basically two ways you can make this work:
Keep using the multimap
but use a custom allocator (so that all allocations are done in the mapped memory region). This is the "nicest" from a high-level C++ viewpoint, /but/ you will need to change to a binary format of your file.
If you can, this is what I'd suggest. Boost Container + Boost Interprocess have everything you need to make this relatively painless.
You write a custom container "abstraction" that works directly on the mapped data. You could either
- recognize a "xxxx yyyy" pair from anywhere (line ends?) or
- build an index of all line starts in the file.
Using these you can devise an interator (Boost Iterator iterator_facade
) that you can use to implement higher level operations (lower_bound
, upper_bound
and equal_range
).
Once you have these, you're basically all set to query this memory map as a readonly key-value database.
Sadly, this kind of memory representation would be extremely bad for performance if you also want to support mutating operations (insert
, remove
).
If you have an actual sample of the file, I could do a demonstration of either of the approaches described.
Update
Quick Samples:
With boost::interprocess you can (very) simply define the multimap you desire:
namespace shared {
namespace bc = boost::container;
template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
template <typename K, typename V>
using multi_map = bc::flat_multimap<
K, V, std::less<K>,
allocator<typename bc::flat_multimap<K, V>::value_type> >;
}
Notes:
I chose flatmap
(flat_multimap
, actually) because it is likely more
storage efficient, and is much more comparable to the second approach
(given below);
Note that this choice affects iterator/reference stability and will
favours read-only operations pretty heavily. If you need iterator
stability and/or many mutating operations, use a regular map
(or for
very high volumes a hash_map
) instead of the flat variations.
I chose a managed_mapped_file
segment for this demonstration (so you get persistence). The demo shows how 10G is sparsely pre-allocated, but only the space actually allocated is used on disk. You could equally well use a managed_shared_memory
.
If you have binary persistence, you might discard the text datafile altogether.
I parse the text data into a shared::multi_map<double, unsigned>
from a mapped_file_source
using Boost Spirit. The implementation is fully generic.
There is no need to write iterator
classes, start_of_line()
, end_of_line()
, lower_bound()
, upper_bound()
, equal_range()
or any of those, since they're already standard in the multi_map
interface, so all we need to is write main
:
Live On Coliru
#define NDEBUG
#undef DEBUG
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/container/flat_map.hpp>
#include <boost/interprocess/managed_mapped_file.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
namespace bip = boost::interprocess;
namespace qi = boost::spirit::qi;
namespace shared {
namespace bc = boost::container;
template <typename T> using allocator = bip::allocator<T, bip::managed_mapped_file::segment_manager>;
template <typename K, typename V>
using multi_map = bc::flat_multimap<
K, V, std::less<K>,
allocator<typename bc::flat_multimap<K, V>::value_type> >;
}
#include <iostream>
bip::managed_mapped_file msm(bip::open_or_create, "lookup.bin", 10ul<<30);
template <typename K, typename V>
shared::multi_map<K,V>& get_or_load(const char* fname) {
using Map = shared::multi_map<K, V>;
Map* lookup = msm.find_or_construct<Map>("lookup")(msm.get_segment_manager());
if (lookup->empty()) {
// only read input file if not already loaded
boost::iostreams::mapped_file_source input(fname);
auto f(input.data()), l(f + input.size());
bool ok = qi::phrase_parse(f, l,
(qi::auto_ >> qi::auto_) % qi::eol >> *qi::eol,
qi::blank, *lookup);
if (!ok || (f!=l))
throw std::runtime_error("Error during parsing at position #" + std::to_string(f - input.data()));
}
return *lookup;
}
int main() {
// parse text file into shared memory binary representation
auto const& lookup = get_or_load<double, unsigned int>("input.txt");
auto const e = lookup.end();
for(auto&& line : lookup)
{
std::cout << line.first << "\t" << line.second << "\n";
auto er = lookup.equal_range(line.first);
if (er.first != e) std::cout << " lower: " << er.first->first << "\t" << er.first->second << "\n";
if (er.second != e) std::cout << " upper: " << er.second->first << "\t" << er.second->second << "\n";
}
}
I implemented it exactly as I described:
- simple container over the raw
const char*
region mapped;
- using
boost::iterator_facade
to make an iterator that parses the text on dereference;
- for printing the input lines I use
boost::string_ref
- which avoids dynamic allocations for copying strings.
parsing is done with Spirit Qi:
if (!qi::phrase_parse(
b, _data.end,
qi::auto_ >> qi::auto_ >> qi::eoi,
qi::space,
_data.key, _data.value))
Qi was chosen for speed and genericity: you can choose the Key
and Value
types at instantiation time:
text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
I've implemented lower_bound
, upper_bound
and equal_range
member functions that take advantage of underlying contiguous storage. Even though the "line" iterator
is not random-access but bidirectional, we can still jump to the mid_point
of such an iterator range because we can get the start_of_line
from any const char*
into the underlying mapped region. This make binary searching efficient.
Note that this solution parses lines on dereference of the iterator
. This might not be efficient if the same lines are dereferenced a lot of times.
But, for infrequent lookups, or lookups that are not typical in the same region of the input data, this is about as efficient as it can possibly get (doing only minimum required parsing and O(log n)
binary searching), all the while completely bypassing the initial load time by mapping the file instead (no access means nothing needs to be loaded).
Live On Coliru (including test data)
#define NDEBUG
#undef DEBUG
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/optional.hpp>
#include <boost/spirit/include/qi.hpp>
#include <thread>
#include <iomanip>
namespace io = boost::iostreams;
namespace qi = boost::spirit::qi;
template <typename Key, typename Value>
struct text_multi_lookup {
text_multi_lookup(char const* begin, char const* end)
: _map_begin(begin),
_map_end(end)
{
}
private:
friend struct iterator;
enum : char { nl = '\n' };
using rawit = char const*;
rawit _map_begin, _map_end;
rawit start_of_line(rawit it) const {
while (it > _map_begin) if (*--it == nl) return it+1;
assert(it == _map_begin);
return it;
}
rawit end_of_line(rawit it) const {
while (it < _map_end) if (*it++ == nl) return it;
assert(it == _map_end);
return it;
}
public:
struct value_type final {
rawit beg, end;
Key key;
Value value;
boost::string_ref str() const { return { beg, size_t(end-beg) }; }
};
struct iterator : boost::iterator_facade<iterator, boost::string_ref, boost::bidirectional_traversal_tag, value_type> {
iterator(text_multi_lookup const& d, rawit it) : _region(&d), _data { it, nullptr, Key{}, Value{} } {
assert(_data.beg == _region->start_of_line(_data.beg));
}
private:
friend text_multi_lookup;
text_multi_lookup const* _region;
value_type mutable _data;
void ensure_parsed() const {
if (!_data.end)
{
assert(_data.beg == _region->start_of_line(_data.beg));
auto b = _data.beg;
_data.end = _region->end_of_line(_data.beg);
if (!qi::phrase_parse(
b, _data.end,
qi::auto_ >> qi::auto_ >> qi::eoi,
qi::space,
_data.key, _data.value))
{
std::cerr << "Problem in: " << std::string(_data.beg, _data.end)
<< "at: " << std::setw(_data.end-_data.beg) << std::right << std::string(_data.beg,_data.end);
assert(false);
}
}
}
static iterator mid_point(iterator const& a, iterator const& b) {
assert(a._region == b._region);
return { *a._region, a._region->start_of_line(a._data.beg + (b._data.beg -a._data.beg)/2) };
}
public:
value_type const& dereference() const {
ensure_parsed();
return _data;
}
bool equal(iterator const& o) const {
return (_region == o._region) && (_data.beg == o._data.beg);
}
void increment() {
_data = { _region->end_of_line(_data.beg), nullptr, Key{}, Value{} };
assert(_data.beg == _region->start_of_line(_data.beg));
}
};
using const_iterator = iterator;
const_iterator begin() const { return { *this, _map_begin }; }
const_iterator end() const { return { *this, _map_end }; }
const_iterator cbegin() const { return { *this, _map_begin }; }
const_iterator cend() const { return { *this, _map_end }; }
template <typename CompatibleKey>
const_iterator lower_bound(CompatibleKey const& key) const {
auto f(begin()), l(end());
while (f!=l) {
auto m = iterator::mid_point(f,l);
if (m->key < key) {
f = m;
++f;
}
else {
l = m;
}
}
return f;
}
template <typename CompatibleKey>
const_iterator upper_bound(CompatibleKey const& key) const {
return upper_bound(key, begin());
}
private:
template <typename CompatibleKey>
const_iterator upper_bound(CompatibleKey const& key, const_iterator f) const {
auto l(end());
while (f!=l) {
auto m = iterator::mid_point(f,l);
if (key < m->key) {
l = m;
}
else {
f = m;
++f;
}
}
return f;
}
public:
template <typename CompatibleKey>
std::pair<const_iterator, const_iterator> equal_range(CompatibleKey const& key) const {
auto lb = lower_bound(key);
return { lb, upper_bound(key, lb) };
}
};
#include <iostream>
int main() {
io::mapped_file_source map("input.txt");
text_multi_lookup<double, unsigned int> tml(map.data(), map.data() + map.size());
auto const e = tml.end();
for(auto&& line : tml)
{
std::cout << line.str();
auto er = tml.equal_range(line.key);
if (er.first != e) std::cout << " lower: " << er.first->str();
if (er.second != e) std::cout << " upper: " << er.second->str();
}
}
For the curious: here's the disassembly. Note how all the algorithmic stuff is inlined right into main
: http://paste.ubuntu.com/9946135/
data_multimap = (std::multimap<double, unsigned int> *)data_file_mapped.data();
, as far I can read from the boost documentation, you have missunderstood that function, that casting will not work, you need to fill the the multimap with the char* provided by data()
I edit to add a bit more detailed content, for example after the mapping, you can do
std::getline(data_file_mapped, oneString);
And after that, deliver the content on the line (you can use a stringstream for that task) and fill your multimap.
Repeat the process until the end of the file.