How to remove duplicates from unsorted std::vector

2019-01-05 02:33发布

I have an array of integers that I need to remove duplicates from while maintaining the order of the first occurrence of each integer. I can see doing it like this, but imagine there is a better way that makes use of STL algorithms better? The insertion is out of my control, so I cannot check for duplicates before inserting.

int unsortedRemoveDuplicates(std::vector<int> &numbers) {
    std::set<int> uniqueNumbers;
    std::vector<int>::iterator allItr = numbers.begin();
    std::vector<int>::iterator unique = allItr;
    std::vector<int>::iterator endItr = numbers.end();

    for (; allItr != endItr; ++allItr) {
        const bool isUnique = uniqueNumbers.insert(*allItr).second;

        if (isUnique) {
            *unique = *allItr;
            ++unique;
        }
    }

    const int duplicates = endItr - unique;

    numbers.erase(unique, endItr);
    return duplicates;
}

How can this be done using STL algorithms?

8条回答
Luminary・发光体
2楼-- · 2019-01-05 03:04

Here is what WilliamKF is searching for. It uses the erase statement. This code is good for lists but isn t good for vectors. For vectors you should not use the erase statement.

//makes uniques in one shot without sorting !! 
template<class listtype> inline
void uniques(listtype* In)
    {

    listtype::iterator it = In->begin();
    listtype::iterator it2= In->begin();

    int tmpsize = In->size();

        while(it!=In->end())
        {
        it2 = it;
        it2++;
        while((it2)!=In->end())
            {
            if ((*it)==(*it2))
                In->erase(it2++);
            else
                ++it2;
            }
        it++;

        }
    }

What I have tryed for vectors without using sort is that:

//makes vectors as fast as possible unique
template<typename T> inline
void vectoruniques(std::vector<T>* In)
    {

    int tmpsize = In->size();

        for (std::vector<T>::iterator it = In->begin();it<In->end()-1;it++)
        {
            T tmp = *it;
            for (std::vector<T>::iterator it2 = it+1;it2<In->end();it2++)
            {
                if (*it2!=*it)
                    tmp = *it2;
                else
                    *it2 = tmp;
            }
        }
        std::vector<T>::iterator it = std::unique(In->begin(),In->end());
        int newsize = std::distance(In->begin(),it);
            In->resize(newsize);
    }

Somehow it looks like this would work. I tested it a bit maybe can somebody tell if this really works ! This solution doesn t need any greater operator. I mean why use the greater operator for seaching unique elements ? Usage for Vectors:

int myints[] = {21,10,20,20,20,30,21,31,20,20,2}; 
std::vector<int> abc(myints , myints+11);
vectoruniques(&abc);
查看更多
闹够了就滚
3楼-- · 2019-01-05 03:05

Here's something that handles POD and non-POD types with move support. Uses default operator== or a custom equality predicate. Does not require sorting/operator<, key generation, or a separate set. No idea if this is more efficient than the other methods described above.

template <typename Cnt, typename _Pr = std::equal_to<typename Cnt::value_type>>
void remove_duplicates( Cnt& cnt, _Pr cmp = _Pr() )
{
    Cnt result;
    result.reserve( std::size( cnt ) );  // or cnt.size() if compiler doesn't support std::size()

    std::copy_if( 
        std::make_move_iterator( std::begin( cnt ) )
        , std::make_move_iterator( std::end( cnt ) )
        , std::back_inserter( result )
        , [&]( const typename Cnt::value_type& what ) 
        { 
            return std::find_if( 
                std::begin( result )
                , std::end( result )
                , [&]( const typename Cnt::value_type& existing ) { return cmp( what, existing ); }
            ) == std::end( result );
        }
    );  // copy_if

    cnt = std::move( result );  // place result in cnt param
}   // remove_duplicates

Usage/tests:

{
    std::vector<int> ints{ 0,1,1,2,3,4 };
    remove_duplicates( ints );
    assert( ints.size() == 5 );
}

{
    struct data 
    { 
        std::string foo; 
        bool operator==( const data& rhs ) const { return this->foo == rhs.foo; }
    };

    std::vector<data>
        mydata{ { "hello" }, {"hello"}, {"world"} }
        , mydata2 = mydata
        ;

    // use operator==
    remove_duplicates( mydata );
    assert( mydata.size() == 2 );

    // use custom predicate
    remove_duplicates( mydata2, []( const data& left, const data& right ) { return left.foo == right.foo; } );
    assert( mydata2.size() == 2 );

}
查看更多
做个烂人
4楼-- · 2019-01-05 03:05

Here is a c++11 generic version that works with iterators and doesn't allocate additional storage. It may have the disadvantage of being O(n^2) but is likely faster for smaller input sizes.

template<typename Iter>
Iter removeDuplicates(Iter begin,Iter end)
{
    auto it = begin;
    while(it != end)
    {
        auto next = std::next(it);
        if(next == end)
        {
            break;
        }
        end = std::remove(next,end,*it);
        it = next;
    }

    return end;
}

....

std::erase(removeDuplicates(vec.begin(),vec.end()),vec.end());

Sample Code: http://cpp.sh/5kg5n

查看更多
我命由我不由天
5楼-- · 2019-01-05 03:11

The naive way is to use std::set as everyone tells you. It's overkill and has poor cache locality (slow).
The smart* way is to use std::vector appropriately (make sure to see footnote at bottom):

#include <algorithm>
#include <vector>
struct target_less
{
    template<class It>
    bool operator()(It const &a, It const &b) const { return *a < *b; }
};
struct target_equal
{
    template<class It>
    bool operator()(It const &a, It const &b) const { return *a == *b; }
};
template<class It> It uniquify(It begin, It const end)
{
    std::vector<It> v;
    v.reserve(static_cast<size_t>(std::distance(begin, end)));
    for (It i = begin; i != end; ++i)
    { v.push_back(i); }
    std::sort(v.begin(), v.end(), target_less());
    v.erase(std::unique(v.begin(), v.end(), target_equal()), v.end());
    std::sort(v.begin(), v.end());
    size_t j = 0;
    for (It i = begin; i != end && j != v.size(); ++i)
    {
        if (i == v[j])
        {
            using std::iter_swap; iter_swap(i, begin);
            ++j;
            ++begin;
        }
    }
    return begin;
}

Then you can use it like:

int main()
{
    std::vector<int> v;
    v.push_back(6);
    v.push_back(5);
    v.push_back(5);
    v.push_back(8);
    v.push_back(5);
    v.push_back(8);
    v.erase(uniquify(v.begin(), v.end()), v.end());
}

*Note: That's the smart way in typical cases, where the number of duplicates isn't too high. For a more thorough performance analysis, see this related answer to a related question.

查看更多
我欲成王,谁敢阻挡
6楼-- · 2019-01-05 03:17

Sounds like a job for std::copy_if. Define a predicate that keeps track of elements that already have been processed and return false if they have.

If you don't have C++11 support, you can use the clumsily named std::remove_copy_if and invert the logic.

This is an untested example:

template <typename T>
struct NotDuplicate {
  bool operator()(const T& element) {
    return s_.insert(element).second; // true if s_.insert(element);
  }
 private:
  std::set<T> s_;
};

Then

std::vector<int> uniqueNumbers;
NotDuplicate<int> pred;
std::copy_if(numbers.begin(), numbers.end(), 
             std::back_inserter(uniqueNumbers),
             std::ref(pred));

where an std::ref has been used to avoid potential problems with the algorithm internally copying what is a stateful functor, although std::copy_if does not place any requirements on side-effects of the functor being applied.

查看更多
淡お忘
7楼-- · 2019-01-05 03:23

Fast and simple, C++11:

template<typename T>
size_t RemoveDuplicatesKeepOrder(std::vector<T>& vec)
{
    std::set<T> seen;

    auto newEnd = std::remove_if(vec.begin(), vec.end(), [&seen](const T& value)
    {
        if (seen.find(value) != std::end(seen))
            return true;

        seen.insert(value);
        return false;
    });

    vec.erase(newEnd, vec.end());

    return vec.size();
}
查看更多
登录 后发表回答