For many problems I see the solution recommended is to use a union-find data structure. I tried to read about it and think about how it is implemented (using C++). My current understanding is that it is nothing but a list of sets. So to find which set an element belongs we require n*log n
operations. And when we have to perform union, then we have to find the two sets which needs to be merged and do a set_union
on them. This doesn't look terribly efficient to me. Is my understanding of this data structure correct or am I missing something?
问题:
回答1:
This is quite late reply, but this has probably not been answered elsewhere on stackoverflow, and since this is top most page for someone searching for union-find, here is the detailed solution.
Find-Union is a very fast operation, performing in near constant time. It follows Jeremie's insights of path compression, and tracking set sizes. Path compression is performed on each find operation itself, thereby taking amortized lg*(n) time. lg* is like the inverse Ackerman function, growing so very slow that it is rarely beyond 5 (at least till n< 2^65535). Union/Merge sets is performed lazy, by just pointing 1 root to another, specifically smaller set's root to larger set's root, which is completed in constant time.
Refer the below code from https://github.com/kartikkukreja/blog-codes/blob/master/src/Union%20Find%20%28Disjoint%20Set%29%20Data%20Structure.cpp
class UF {
int *id, cnt, *sz;
public:
// Create an empty union find data structure with N isolated sets.
UF(int N) {
cnt = N; id = new int[N]; sz = new int[N];
for (int i = 0; i<N; i++) id[i] = i, sz[i] = 1; }
~UF() { delete[] id; delete[] sz; }
// Return the id of component corresponding to object p.
int find(int p) {
int root = p;
while (root != id[root]) root = id[root];
while (p != root) { int newp = id[p]; id[p] = root; p = newp; }
return root;
}
// Replace sets containing x and y with their union.
void merge(int x, int y) {
int i = find(x); int j = find(y); if (i == j) return;
// make smaller root point to larger one
if (sz[i] < sz[j]) { id[i] = j, sz[j] += sz[i]; }
else { id[j] = i, sz[i] += sz[j]; }
cnt--;
}
// Are objects x and y in the same set?
bool connected(int x, int y) { return find(x) == find(y); }
// Return the number of disjoint sets.
int count() { return cnt; }
};
Kindly up-vote or accept if you like.
回答2:
The data structure can be represented as a tree, with branches reversed (instead of pointing down, the branches point upwards to the parent---and link a child with its parent).
If I remember correctly, it can be shown (easily):
that path compression (whenever you do a lookup for the "parent" of a set A, you "compress" the path so that each future call to these will provide the parent in time O(1)) will lead to O(log n) complexity per call;
that balancing (you keep approximately track of the number of children each set has, and when you have to "unite" two sets, you make the one with the fewer children child of the one with the most) also leads to a O(log n) complexity per call.
A more involved proof can show that when you combine both optimizations, you obtain an average complexity that is the inverse Ackermann function, written α(n), and this was Tarjan's main invention for this structure.
It was later shown, I believe, that for some specific usage patterns, this complexity is actually constant (though for all practical purpose inverse of ackermann is about 4). According to the Wikipedia page on Union-Find, in 1989, the amortized cost per operation of any equivalent data structure was shown to be Ω(α(n)), proving that the current implementation is asymptotically optimal.
回答3:
A proper union-find data structure uses path compression during every find. This amortizes the cost and each operation is then proportional to the inverse of the ackermann function which basically makes it constant (but not quite).
If you are implementing it from scratch then I would suggest using a tree-based approach.
回答4:
A simple union-set structure keeps an array (element -> set), making finding which set constant time; updating them is amortized log n time and concatenating the lists is constant. Not as quick as some of the approaches above, but trivial to program and more then good enough to improve the Big-O running time of, say, Kruskal's Minimal Spanning Tree Algorithm.