C++ string sort like a human being?

2020-06-03 05:00发布

问题:

I would like to sort alphanumeric strings the way a human being would sort them. I.e., "A2" comes before "A10", and "a" certainly comes before "Z"! Is there any way to do with without writing a mini-parser? Ideally it would also put "A1B1" before "A1B10". I see the question "Natural (human alpha-numeric) sort in Microsoft SQL 2005" with a possible answer, but it uses various library functions, as does "Sorting Strings for Humans with IComparer".

Below is a test case that currently fails:

#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>

template <typename T>
struct LexicographicSort {
  inline bool operator() (const T& lhs, const T& rhs) const{
    std::ostringstream s1,s2;
    s1 << toLower(lhs); s2 << toLower(rhs);
    bool less = s1.str() < s2.str();
    //Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
    std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
    return less;
  }

  inline std::string toLower(const std::string& str) const {
    std::string newString("");
    for (std::string::const_iterator charIt = str.begin();
         charIt!=str.end();++charIt) {
          newString.push_back(std::tolower(*charIt));
        }
        return newString;
      }
};


int main(void) {
  const std::string reference[5] = {"ab","B","c1","c2","c10"};
  std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));

  //Insert in reverse order so we know they get sorted
  std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());

  std::cout<<"Items:\n";
  std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
  std::vector<std::string> sortedStrings(strings.begin(), strings.end());
  assert(sortedStrings == referenceStrings);
}

回答1:

Is there any way to do with without writing a mini-parser?

Let someone else do that?

I'm using this implementation: http://www.davekoelle.com/alphanum.html, I've modified it to support wchar_t, too.



回答2:

It really depends what you mean by "parser." If you want to avoid writing a parser, I would think you should avail yourself of library functions.

  • Treat the string as a sequence of subsequences which are uniformly alphabetic, numeric, or "other."
  • Get the next alphanumeric sequence of each string using isalnum and backtrack-checking for + or - if it is a number. Use strtold in-place to find the end of a numeric subsequence.
  • If one is numeric and one is alphabetic, the string with the numeric subsequence comes first.
  • If one string has run out of characters, it comes first.
  • Use strcoll to compare alphabetic subsequences within the current locale.
  • Use strtold to compare numeric subsequences within the current locale.
  • Repeat until finished with one or both strings.
  • Break ties with strcmp.

This algorithm has something of a weakness in comparing numeric strings which exceed the precision of long double.



回答3:

Is there any way to do it without writing a mini parser? I would think the answer is no. But writing a parser isn't that tough. I had to do this a while ago to sort our company's stock numbers. Basically just scan the number and turn it into an array. Check the "type" of every character: alpha, number, maybe you have others you need to deal with special. Like I had to treat hyphens special because we wanted A-B-C to sort before AB-A. Then start peeling off characters. As long as they are the same type as the first character, they go into the same bucket. Once the type changes, you start putting them in a different bucket. Then you also need a compare function that compares bucket-by-bucket. When both buckets are alpha, you just do a normal alpha compare. When both are digits, convert both to integer and do an integer compare, or pad the shorter to the length of the longer or something equivalent. When they're different types, you'll need a rule for how those compare, like does A-A come before or after A-1 ?

It's not a trivial job and you have to come up with rules for all the odd cases that may arise, but I would think you could get it together in a few hours of work.



回答4:

Without any parsing, there's no way to compare human written numbers (high values first with leading zeroes stripped) and normal characters as part of the same string.

The parsing doesn't need to be terribly complex though. A simple hash table to deal with things like case sensitivity and stripping special characters ('A'='a'=1,'B'='b'='2,... or 'A'=1,'a'=2,'B'=3,..., '-'=0(strip)), remap your string to an array of the hashed values, then truncate number cases (if a number is encountered and the last character was a number, multiply the last number by ten and add the current value to it).

From there, sort as normal.