Simplest way to read a CSV file mapped to memory?

2020-02-06 10:09发布

When I read from files in C++(11) I map them in to memory using:

boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_only);
boost::interprocess::mapped_region* region = new mapped_region(*fm, boost::interprocess::read_only);
char* bytes = static_cast<char*>(region->get_address());

Which is fine when I wish to read byte by byte extremely fast. However, I have created a csv file which I would like to map to memory, read each line and split each line on the comma.

Is there a way I can do this with a few modifications of my above code?

(I am mapping to memory because I have an awful lot of memory and I do not want any bottleneck with disk/IO streaming).

2条回答
女痞
2楼-- · 2020-02-06 10:27

Simply create an istringstream from your memory mapped bytes and parse that using :

const std::string stringBuffer(bytes, region->get_size());
std::istringstream is(stringBuffer);
typedef boost::tokenizer< boost::escaped_list_separator<char> > Tokenizer;
std::string line;
std::vector<std::string> parsed;
while(getline(is, line))
{
    Tokenizer tokenizer(line);
    parsed.assign(tokenizer.begin(),tokenizer.end());
    for (auto &column: parsed)
    {
        // 
    }
}

Note that on many systems memory mapping isn't providing any speed benefit compared to sequential read. In both cases you will end up reading the data from the disk page by page, probably with the same amount of read ahead, and both the IO latency and bandwidth will be the same in both cases. Whether you have lots of memory or not won't make any difference. Also, depending on the system, memory_mapping, even read-only, might lead to surprising behaviours (e.g. reserving swap space) that don't that sometimes keep people busy troubleshooting.

查看更多
三岁会撩人
3楼-- · 2020-02-06 10:33

Here's my take on "fast enough". It zips through 116 MiB of CSV (2.5Mio lines[1]) in ~1 second.

The result is then randomly accessible at zero-copy, so no overhead (unless pages are swapped out).

For comparison:

  • that's ~3x faster than a naive wc csv.txt takes on the same file
  • it's about as fast as the following perl one liner (which lists the distinct field counts on all lines):

    perl -ne '$fields{scalar split /,/}++; END { map { print "$_\n" } keys %fields  }' csv.txt
    
  • it's only slower than (LANG=C wc csv.txt) which avoids locale functionality (by about 1.5x)

Here's the parser in all it's glory:

using CsvField = boost::string_ref;
using CsvLine  = std::vector<CsvField>;
using CsvFile  = std::vector<CsvLine>;  // keep it simple :)

struct CsvParser : qi::grammar<char const*, CsvFile()> {
    CsvParser() : CsvParser::base_type(lines)
    {
        using namespace qi;

        field = raw [*~char_(",\r\n")] 
            [ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
        line  = field % ',';
        lines = line  % eol;
    }
    // declare: line, field, fields
};

The only tricky thing (and the only optimization there) is the semantic action to construct a CsvField from the source iterator with the matches number of characters.

Here's the main:

int main()
{
    boost::iostreams::mapped_file_source csv("csv.txt");

    CsvFile parsed;
    if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
    {
        std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values\n";
    }
}

Printing

116 MiB parsed into 2578421 lines of CSV values

You can use the values just as std::string:

for (int i = 0; i < 10; ++i)
{
    auto l     = rand() % parsed.size();
    auto& line = parsed[l];
    auto c     = rand() % line.size();

    std::cout << "Random field at L:" << l << "\t C:" << c << "\t" << line[c] << "\n";
}

Which prints eg.:

Random field at L:1979500    C:2    sateen's
Random field at L:928192     C:1    sackcloth's
Random field at L:1570275    C:4    accompanist's
Random field at L:479916     C:2    apparel's
Random field at L:767709     C:0    pinks
Random field at L:1174430    C:4    axioms
Random field at L:1209371    C:4    wants
Random field at L:2183367    C:1    Klondikes
Random field at L:2142220    C:1    Anthony
Random field at L:1680066    C:2    pines

The fully working sample is here Live On Coliru


[1] I created the file by repeatedly appending the output of

while read a && read b && read c && read d && read e
do echo "$a,$b,$c,$d,$e"
done < /etc/dictionaries-common/words

to csv.txt, until it counted 2.5 million lines.

查看更多
登录 后发表回答