I need to take a C++ vector with potentially a lot of elements, erase duplicates, and sort it.
I currently have the below code, but it doesn't work.
vec.erase(
std::unique(vec.begin(), vec.end()),
vec.end());
std::sort(vec.begin(), vec.end());
How can I correctly do this?
Additionally, is it faster to erase the duplicates first (similar to coded above) or perform the sort first? If I do perform the sort first, is it guaranteed to remain sorted after std::unique
is executed?
Or is there another (perhaps more efficient) way to do all this?
The standard approach suggested by Nate Kohl, just using vector, sort + unique:
doesn't work for a vector of pointers.
Look carefully at this example on cplusplus.com.
In their example, the "so called duplicates" moved to the end are actually shown as ? (undefined values), because those "so called duplicates" are SOMETIMES "extra elements" and SOMETIMES there are "missing elements" that were in the original vector.
A problem occurs when using
std::unique()
on a vector of pointers to objects (memory leaks, bad read of data from HEAP, duplicate frees, which cause segmentation faults, etc).Here's my solution to the problem: replace
std::unique()
withptgi::unique()
.See the file ptgi_unique.hpp below:
And here is the UNIT Test program that I used to test it:
If you are looking for performance and using
std::vector
, I recommend the one that this documentation link provides.I redid Nate Kohl's profiling and got different results. For my test case, directly sorting the vector is always more efficient than using a set. I added a new more efficient method, using an
unordered_set
.Keep in mind that the
unordered_set
method only works if you have a good hash function for the type you need uniqued and sorted. For ints, this is easy! (The standard library provides a default hash which is simply the identity function.) Also, don't forget to sort at the end since unordered_set is, well, unordered :)I did some digging inside the
set
andunordered_set
implementation and discovered that the constructor actually construct a new node for every element, before checking its value to determine if it should actually be inserted (in Visual Studio implementation, at least).Here are the 5 methods:
f1: Just using
vector
,sort
+unique
f2: Convert to
set
(using a constructor)f3: Convert to
set
(manually)f4: Convert to
unordered_set
(using a constructor)f5: Convert to
unordered_set
(manually)I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000]
The results (in seconds, smaller is better):
std::unique
only removes duplicate elements if they're neighbours: you have to sort the vector first before it will work as you intend.std::unique
is defined to be stable, so the vector will still be sorted after running unique on it.I'm not sure what you are using this for, so I can't say this with 100% certainty, but normally when I think "sorted, unique" container, I think of a std::set. It might be a better fit for your usecase:
Otherwise, sorting prior to calling unique (as the other answers pointed out) is the way to go.
About alexK7 benchmarks. I tried them and got similar results, but when the range of values is 1 million the cases using std::sort (f1) and using std::unordered_set (f5) produce similar time. When the range of values is 10 million f1 is faster than f5.
If the range of values is limited and the values are unsigned int, it is possible to use std::vector, the size of which corresponds to the given range. Here is the code: