Multiple delimiters for getline function, c++ [dup

2020-03-24 07:15发布

问题:

I want to read a text word by word, avoiding any non-alphanumeric characters in a simple way. After 'evolving' from text with white-spaces and '\n', I need to solve that problem in case there are also ',', '.' for example. The first case was simply solved by using getline with delimiter ' '. I wondered if there's a way to use getline with multiple delimiters, or even with some kind of regular expression (for example '.'|' '|','|'\n' ).

As far as I know, getline works in a way that it reads characters from the input stream, until either '\n' or delimiter character reached. My first guess was that it is quite simple to provide it with multiple delimiters, but I found out that it's not.

Edit: just as a clarification. Any C style (strtok for example, which is for my opinion very ugly) or algorithmic type of solution is not what I'm looking for. It is fairly easy to come up with a simple algorithm to solve that problem, and implement it. I'm looking for a more elegant solution, or at least an explanation for why can't we handle it with the getline function, since unless I completely misunderstood, should be able to somehow accept more than one delimiter.

回答1:

There's good news and bad news. The good news is that you can do this.

The bad news is that doing it is fairly roundabout, and some people find it downright ugly and nasty.

To do it, you start by observing two facts:

  1. The normal string extractor uses whitespace to delimit "words".
  2. What constitutes white space is defined in the stream's locale.

Putting those together, the answer becomes fairly obvious (if circuitous): to define multiple delimiters, we define a locale that allows us to specify what characters should be treated as delimiters (i.e., white space):

struct word_reader : std::ctype<char> {
    word_reader(std::string const &delims) : std::ctype<char>(get_table(delims)) {}
    static std::ctype_base::mask const* get_table(std::string const &delims) {
        static std::vector<std::ctype_base::mask> rc(table_size, std::ctype_base::mask());

        for (char ch : delims)
            rc[ch] = std::ctype_base::space;
        return &rc[0];
    }
};

Then we need to tell the stream to use that locale (well, a locale with that ctype facet), passing the characters we want used as delimiters, and then extract words from the stream:

int main() {
    std::istringstream in("word1, word2. word3,word4");

    // create a ctype facet specifying delimiters, and tell stream to use it:
    in.imbue(std::locale(std::locale(), new word_reader(" ,.\n")));
    std::string word;

    // read words from the stream. Note we just use `>>`, not `std::getline`:
    while (in >> word)
        std::cout << word << "\n";
}

The result is what (I hope) you want: extracting each word without the punctuation we said was "white space".

word1
word2
word3
word4