What is wrong with `std::set`?

2019-01-14 02:26发布

问题:

In the other topic I was trying to solve this problem. The problem was to remove duplicate characters from a std::string.

std::string s= "saaangeetha";

Since the order was not important, so I sorted s first, and then used std::unique and finally resized it to get the desired result:

aeghnst

That is correct!


Now I want to do the same, but at the same time I want the order of characters intact. Means, I want this output:

sangeth

So I wrote this:

template<typename T>
struct is_repeated
{
    std::set<T>  unique;
    bool operator()(T c) { return !unique.insert(c).second; }
}; 
int main() {
    std::string s= "saaangeetha";
    s.erase(std::remove_if(s.begin(), s.end(), is_repeated<char>()), s.end()); 
    std::cout << s ;
}

Which gives this output:

saangeth

That is, a is repeated, though other repetitions gone. What is wrong with the code?

Anyway I change my code a bit: (see the comment)

template<typename T>
struct is_repeated
{
    std::set<T> & unique;  //made reference!
    is_repeated(std::set<T> &s) : unique(s) {} //added line!
    bool operator()(T c) { return !unique.insert(c).second; }
}; 
int main() {
    std::string s= "saaangeetha";
    std::set<char> set; //added line!
    s.erase(std::remove_if(s.begin(),s.end(),is_repeated<char>(set)),s.end()); 
    std::cout << s ;
}

Output:

sangeth

Problem gone!

So what is wrong with the first solution?

Also, if I don't make the member variable unique reference type, then the problem doesn't go.

What is wrong with std::set or is_repeated functor? Where exactly is the problem?

I also note that if the is_repeated functor is copied somewhere, then every member of it is also copied. I don't see the problem here!

回答1:

In GCC (libstdc++), remove_if is implemented essentially as

    template<typename It, typename Pred>
    It remove_if(It first, It last, Pred predicate) {
      first = std::find_if(first, last, predicate);
    //                                  ^^^^^^^^^
      if (first == last)
         return first;
      else {
         It result = first;
         ++ result;
         for (; first != last; ++ first) {
           if (!predicate(*first)) {
    //          ^^^^^^^^^
              *result = std::move(*first);
              ++ result;
           }
         }
      }
    }

Note that your predicate is passed by-value to find_if, so the struct, and therefore the set, modified inside find_if will not be propagated back to caller.

Since the first duplicate appears at:

  saaangeetha
//  ^

The initial "sa" will be kept after the find_if call. Meanwhile, the predicate's set is empty (the insertions within find_if are local). Therefore the loop afterwards will keep the 3rd a.

   sa | angeth
// ^^   ^^^^^^
// ||   kept by the loop in remove_if
// ||
// kept by find_if


回答2:

Functors are supposed to be designed in a way where a copy of a functor is identical to the original functor. That is, if you make a copy of one functor and then perform a sequence of operations, the result should be the same no matter which functor you use, or even if you interleave the two functors. This gives the STL implementation the flexibility to copy functors and pass them around as it sees fit.

With your first functor, this claim does not hold because if I copy your functor and then call it, the changes you make to its stored set do not reflect in the original functor, so the copy and the original will perform differently. Similarly, if you take your second functor and make it not store its set by reference, the two copies of the functor will not behave identically.

The reason that your final version of the functor works, though, is because the fact that the set is stored by reference means that any number of copies of tue functor will behave identically to one another.

Hope this helps!



回答3:

Not really an answer, but as another interesting tidbit to consider, this does work, even though it uses the original functor:

#include <set>
#include <iostream>
#include <string>
#include <algorithm>
#include <iterator>

template<typename T>
struct is_repeated {
    std::set<T>  unique;
    bool operator()(T c) { return !unique.insert(c).second; }
}; 
int main() {
    std::string s= "saaangeetha";
    std::remove_copy_if(s.begin(), s.end(), 
                        std::ostream_iterator<char>(std::cout), 
                        is_repeated<char>());
    return 0;
}

Edit: I don't think it affects this behavior, but I've also corrected a minor slip in your functor (operator() should apparently take a parameter of type T, not char).



回答4:

I suppose the problem could lie in that the is_repeated functor is copied somewhere inside the implementation of std::remove_if. If that is the case, the default copy constructor is used and this in turn calls std::set copy constructor. You end up with two is_repeated functors possibly used independently. However as the sets in both of them are distinct objects, they don't see the mutual changes. If you turn the field is_repeated::unique to a reference, then the copied functor still uses the original set which is what you want in this case.



回答5:

Functor classes should be pure functions and have no state of their own. See item 39 in Scott Meyer's Effective STL book for a good explanation on this. But the gist of it is that your functor class may be copied 1 or more times inside the algorithm.



回答6:

The other answers are correct, in that the issue is that the functor that you are using is not copyable safe. In particular, the STL that comes with gcc (4.2) implements std::remove_if as a combination of std::find_if to locate the first element to delete followed by a std::remove_copy_if to complete the operation.

template <typename ForwardIterator, typename Predicate>
std::remove_if( ForwardIterator first, ForwardIterator end, Predicate pred ) {
   first = std::find_if( first, end, pred ); // [1]
   ForwardIterator i = it;
   return first == last? first 
          : std::remove_copy_if( ++i, end, fist, pred ); // [2]
}

The copy in [1] means that the first element found is added to the copy of the functor and that means that the first 'a' will be lost in oblivion. The functor is also copied in [2], and that would be fine if it were not because the original for that copy is an empty functor.



回答7:

Depending on the implementation of remove_if can make copies of your predicate. Either refactor your functor and make it stateless or use Boost.Ref to "for passing references to function templates (algorithms) that would usually take copies of their arguments", like so:

#include <set>
#include <iostream>
#include <string>
#include <algorithm>
#include <iterator>

#include <boost/ref.hpp>
#include <boost/bind.hpp>

template<typename T>
struct is_repeated {
    std::set<T>  unique;
    bool operator()(T c) { return !unique.insert(c).second; }
}; 

int main() {
    std::string s= "saaangeetha";
    s.erase(std::remove_if(s.begin(), s.end(), boost::bind<bool>(boost::ref(is_repeated<char>()),_1)), s.end());
    std::cout << s;

    return 0;
}