Consider there are some lists of integers as:
#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------
The question is to merge lists having at least one common element. So the results only for the given part will be as follows:
#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------
What is the most efficient way to do this on large data (elements are just numbers)?
Is tree
structure something to think about?
I do the job now by converting lists to sets
and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1:
Here is the code I was using:
#--------------------------------------
lsts = [[0,1,3],
[1,0,3,4,5,10,11],
[2,8],
[3,1,0,16]];
#--------------------------------------
The function is (buggy!!):
#--------------------------------------
def merge(lsts):
sts = [set(l) for l in lsts]
i = 0
while i < len(sts):
j = i+1
while j < len(sts):
if len(sts[i].intersection(sts[j])) > 0:
sts[i] = sts[i].union(sts[j])
sts.pop(j)
else: j += 1 #---corrected
i += 1
lst = [list(s) for s in sts]
return lst
#--------------------------------------
The result is:
#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------
Update 2:
To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting).
The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!
Update 3: Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.
Update 4:
There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.
Firstly I'm not exactly sure if the benchmarks are fair:
Adding the following code to the start of my function:
This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).
Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.
Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.
Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:
I thought I would order everything
Here's my answer. I haven't checked it against today's batch of answers.
The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.
(Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).
This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs: (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)
EDIT: OK, the other questions has been closed, posting here.
Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent
networkx
graph library and thepairs
function from this question.Explanation
We create a new (empty) graph
g
. For each sub-list inlists
, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note thatadd_edge
takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.
This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.
It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.
Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:
.