How do I tokenize a string in C++?

2018-12-31 00:03发布

Java has a convenient split method:

String str = "The quick brown fox";
String[] results = str.split(" ");

Is there an easy way to do this in C++?

30条回答
十年一品温如言
2楼-- · 2018-12-31 00:15

This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:

#include <string>
#include <vector>

void tokenize(std::string str, std::vector<string> &token_v){
    size_t start = str.find_first_not_of(DELIMITER), end=start;

    while (start != std::string::npos){
        // Find next occurence of delimiter
        end = str.find(DELIMITER, start);
        // Push back the token found into vector
        token_v.push_back(str.substr(start, end-start));
        // Skip all occurences of the delimiter to find new start
        start = str.find_first_not_of(DELIMITER, end);
    }
}

Try it out live!

查看更多
春风洒进眼中
3楼-- · 2018-12-31 00:16

If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.

查看更多
时光乱了年华
4楼-- · 2018-12-31 00:16

boost::tokenizer is your friend, but consider making your code portable with reference to internationalization (i18n) issues by using wstring/wchar_t instead of the legacy string/char types.

#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

using namespace std;
using namespace boost;

typedef tokenizer<char_separator<wchar_t>,
                  wstring::const_iterator, wstring> Tok;

int main()
{
  wstring s;
  while (getline(wcin, s)) {
    char_separator<wchar_t> sep(L" "); // list of separator characters
    Tok tok(s, sep);
    for (Tok::iterator beg = tok.begin(); beg != tok.end(); ++beg) {
      wcout << *beg << L"\t"; // output (or store in vector)
    }
    wcout << L"\n";
  }
  return 0;
}
查看更多
听够珍惜
5楼-- · 2018-12-31 00:17

Boost has a strong split function: boost::algorithm::split.

Sample program:

#include <vector>
#include <boost/algorithm/string.hpp>

int main() {
    auto s = "a,b, c ,,e,f,";
    std::vector<std::string> fields;
    boost::split(fields, s, boost::is_any_of(","));
    for (const auto& field : fields)
        std::cout << "\"" << field << "\"\n";
    return 0;
}

Output:

"a"
"b"
" c "
""
"e"
"f"
""
查看更多
后来的你喜欢了谁
6楼-- · 2018-12-31 00:18

I thought that was what the >> operator on string streams was for:

string word; sin >> word;
查看更多
美炸的是我
7楼-- · 2018-12-31 00:19

Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:

auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };

while (start != cend(str)) {
    const auto finish = find(++start, cend(str), ' ');

    tokens.push_back(string(start, finish));
    start = finish;
}

Live Example


If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:

vector<string> tokens;

for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);

If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa

Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:

  1. strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
  2. For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
  3. Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on

Both the previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:

istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };

Live Example

The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.


If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:

const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };

Live Example

查看更多
登录 后发表回答