How do I tokenize a string in C++?

2018-12-31 00:03发布

Java has a convenient split method:

String str = "The quick brown fox";
String[] results = str.split(" ");

Is there an easy way to do this in C++?

30条回答
梦醉为红颜
2楼-- · 2018-12-31 00:22

I know you asked for a C++ solution, but you might consider this helpful:

Qt

#include <QString>

...

QString str = "The quick brown fox"; 
QStringList results = str.split(" "); 

The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.

See more at Qt documentation

查看更多
明月照影归
3楼-- · 2018-12-31 00:22

Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).

#include <string.h> // for strchr and strlen

/*
 * want_empty_tokens==true  : include empty tokens, like strsep()
 * want_empty_tokens==false : exclude empty tokens, like strtok()
 */
std::vector<std::string> tokenize(const char* src,
                                  char delim,
                                  bool want_empty_tokens)
{
  std::vector<std::string> tokens;

  if (src and *src != '\0') // defensive
    while( true )  {
      const char* d = strchr(src, delim);
      size_t len = (d)? d-src : strlen(src);

      if (len or want_empty_tokens)
        tokens.push_back( std::string(src, len) ); // capture token

      if (d) src += len+1; else break;
    }

  return tokens;
}
查看更多
孤独总比滥情好
4楼-- · 2018-12-31 00:22

There is no direct way to do this. Refer this code project source code to find out how to build a class for this.

查看更多
无色无味的生活
5楼-- · 2018-12-31 00:23

MFC/ATL has a very nice tokenizer. From MSDN:

CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;

resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
   printf("Resulting token: %s\n", resToken);
   resToken= str.Tokenize("% #",curPos);
};

Output

Resulting Token: First
Resulting Token: Second
Resulting Token: Third
查看更多
忆尘夕之涩
6楼-- · 2018-12-31 00:24

Your simple case can easily be built using the std::string::find method. However, take a look at Boost.Tokenizer. It's great. Boost generally has some very cool string tools.

查看更多
唯独是你
7楼-- · 2018-12-31 00:26

Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.

Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:

std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
    std::smatch m{};
    std::vector<std::string> ret{};
    while (std::regex_search (it,end,m,e)) {
        ret.emplace_back(m.str());              
        std::advance(it, m.position() + m.length()); //next start position = match position + match length
    }
    return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){  //comfort version calls flexible version
    return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
    std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
    auto v = split(str);
    for(const auto&s:v){
        std::cout << s << std::endl;
    }
    std::cout << "crazy version:" << std::endl;
    v = split(str, std::regex{"[^e]+"});  //using e as delim shows flexibility
    for(const auto&s:v){
        std::cout << s << std::endl;
    }
    return 0;
}

If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:

template<bool...> struct BoolSequence{};        //just here to hold bools
template<char...> struct CharSequence{};        //just here to hold chars
template<typename T, char C> struct Contains;   //generic
template<char First, char... Cs, char Match>    //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
    Contains<CharSequence<Cs...>, Match>{};     //strip first and increase index
template<char First, char... Cs>                //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {}; 
template<char Match>                            //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};

template<int I, typename T, typename U> 
struct MakeSequence;                            //generic
template<int I, bool... Bs, typename U> 
struct MakeSequence<I,BoolSequence<Bs...>, U>:  //not last
    MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U> 
struct MakeSequence<0,BoolSequence<Bs...>,U>{   //last  
    using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
    /* could be made constexpr but not yet supported by MSVC */
    static bool isDelim(const char c){
        static const bool table[256] = {Bs...};
        return table[static_cast<int>(c)];
    }   
};
using Delims = CharSequence<'.',',',' ',':','\n'>;  //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;

With that in place making a getNextToken function is easy:

template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
    begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
    auto second = std::find_if(begin,end,Table{});      //find first delim or end
    return std::make_pair(begin,second);
}

Using it is also easy:

int main() {
    std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
    auto it = std::begin(s);
    auto end = std::end(s);
    while(it != std::end(s)){
        auto token = getNextToken(it,end);
        std::cout << std::string(token.first,token.second) << std::endl;
        it = token.second;
    }
    return 0;
}

Here is a live example: http://ideone.com/GKtkLQ

查看更多
登录 后发表回答