I would like to know how a set is implemented in C++. If I were to implement my own set container without using the STL provided container, what would be the best way to go about this task?
I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array?
Also, how does insert()
work for a set? How does the set check whether an element already exists in it?
I read on wikipedia that another way to implement a set is with a hash table. How would this work?
cppreference says:
I checked, and both
libc++
andlibstdc++
do use red-black trees forstd::set
.std::unordered_set
was implemented with a hash table inlibc++
and I presume the same forlibstdc++
but didn't check.Edit: Apparently my word is not good enough.
libc++
: 1 2libstdc++
: 1As KTC said, how
std::set
is implemented can vary -- the C++ standard simply specifies an abstract data type. In other words, the standard does not specify how a container should be implemented, just what operations it is required to support. However, most implementations of the STL do, as far as I am aware, use red-black trees or other balanced binary search trees of some kind (GNU libstdc++, for instance, uses red-black trees).While you could theoretically implement a set as a hash table and get faster asymptotic performance (amortized O(key length) versus O(log n) for lookup and insert), that would require having the user supply a hash function for whatever type they wanted to store (see Wikipedia's entry on hash tables for a good explanation of how they work). As for an implementation of a binary search tree, you wouldn't want to use an array -- as Raul mentioned, you would want some kind of
Node
data structure.You could implement a binary search tree by first defining a
Node
struct:Then, you could define a root of the tree with another
Node *rootNode;
The Wikipedia entry on Binary Search Tree has a pretty good example of how to implement an insert method, so I would also recommend checking that out.
In terms of duplicates, they are generally not allowed in sets, so you could either just discard that input, throw an exception, etc, depending on your specification.
Step debug into
g++
6.4 stdlibc++ sourceDid you know that on Ubuntu's 16.04 default
g++-6
package or a GCC 6.4 build from source you can step into the C++ library without any further setup?By doing that we easily conclude that a Red-black tree used.
This makes sense, since
std::set
can be traversed in order, which would not be efficient in if a hash map were used.a.cpp:
Compile and debug:
Now, if you step into
s.insert(1)
you immediately reach/usr/include/c++/6/bits/stl_set.h
:which clearly just forwards to
_M_t._M_insert_unique
.So we open the source file in vim and find the definition of
_M_t
:So
_M_t
is of type_Rep_type
and_Rep_type
is a_Rb_tree
.OK, now that is enough evidence for me. If you don't believe that
_Rb_tree
is a Black-red tree, step a bit further and read the algorithmunordered_set
uses hash tableSame procedure, but replace
set
withunordered_set
on the code.This makes sense, since
std::unordered_set
cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.Stepping into
insert
leads to/usr/include/c++/6/bits/unordered_set.h
:So we open the source file in
vim
and search for_M_h
:So hash table it is.
std::map
andstd::unordered_map
Analogous for
std::set
vsstd:unordered_set
: What data structure is inside std::map in C++?Performance characteristics
You could also infer the data structure used by timing them.
We clearly see for:
std::set
, a logarithmic insertion timestd::unordered_set
, a more complex pattern hashmap pattern:on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the
std::map
, except for very small map sizesSeveral strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.
Graph generated with:
Heap vs BST analysis at: Heap vs Binary Search Tree (BST)
As others have pointed out, it varies. A set is commonly implemented as a tree (red-black tree, balanced tree, etc.) though but there may be other implementations that exist.
It depends on the underlying implementation of your set. If it is implemented as a binary tree, Wikipedia has a sample recursive implementation for the insert() function. You may want to check it out.
If it is implemented as a tree, then it traverses the tree and check each element. However, sets do not allow duplicate elements to be stored though. If you want a set that allows duplicate elements, then you need a multiset.
You may be referring to a hash_set, where the set is implemented using hash tables. You'll need to provide a hash function to know which location to store your element. This implementation is ideal when you want to be able to search for an element quickly. However, if it is important for your elements to be stored in particular order, then the tree implementation is more appropriate since you can traverse it preorder, inorder or postorder.
How a particular container is implemented in C++ is entirely implementation dependent. All that is required is for the result to meet the requirements set out in the standard, such as complexity requirement for the various methods, iterators requirements etc.