What would be the best way to implement approximate Disjoint Sets using SQL?
Details
I have a table of edges, stored as a two-column table of [vertex_a, vertex_b]
.
I need a table of distinct sets, stored as [vertex, set_id]
with one row per vertex, labeling each vertex with a disjoint set_id.
Constraints
- Must be a purely SQL implementation. It can leverage Postgres-specific functions, but pure ANSI SQL highly preferred.
- The result can be approximate- it's acceptable to label a few sets as disjoint when they're actually connected. It's even better if the approximation bounds can be adjusted- by increasing iterations for example.
- Libraries are out (no Boost, Numpy, Scipy). Must be SQL.
- Most sets will contain 1 to 3 vertices. Very few large sets, expected max to be 10 vertices.
Related
- Related to: Implementing Disjoint Sets (Union Find) in C++
- This would be an approximate implementation of Disjoint-set (Union Find) - Wikipedia
This pure sql code grouped about 35000 records in 5 minutes (8 cores/32 gb ram). Enjoy.
I'm actually working on the same problem. Unfortunately, I don't think a very efficient solution can be found - at least not easily using just SQL. Just remove the 'distinct' and the self-eliminating query to observe how large the working sets become. That said, the following solution will work.
But again, this is completely impractical on larger data sets; any other solution would require iteration.