Python: simple list merging based on intersections

2019-01-01 15:59发布

Consider there are some lists of integers as:

#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------

The question is to merge lists having at least one common element. So the results only for the given part will be as follows:

#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------

What is the most efficient way to do this on large data (elements are just numbers)? Is tree structure something to think about? I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1: Here is the code I was using:

#--------------------------------------
lsts = [[0,1,3],
        [1,0,3,4,5,10,11],
        [2,8],
        [3,1,0,16]];
#--------------------------------------

The function is (buggy!!):

#--------------------------------------
def merge(lsts):
    sts = [set(l) for l in lsts]
    i = 0
    while i < len(sts):
        j = i+1
        while j < len(sts):
            if len(sts[i].intersection(sts[j])) > 0:
                sts[i] = sts[i].union(sts[j])
                sts.pop(j)
            else: j += 1                        #---corrected
        i += 1
    lst = [list(s) for s in sts]
    return lst
#--------------------------------------

The result is:

#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------

Update 2: To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting). The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!

Update 3: Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.

Update 4:
There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.

15条回答
步步皆殇っ
2楼-- · 2019-01-01 16:18

Just for fun...

def merge(mylists):
    results, sets = [], [set(lst) for lst in mylists if lst]
    upd, isd, pop = set.update, set.isdisjoint, sets.pop
    while sets:
        if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
            results.append(pop(0))
    return results

and my rewrite of the best answer

def merge(lsts):
  sets = map(set,lsts)
  results = []
  while sets:
    first, rest = sets[0], sets[1:]
    merged = False
    sets = []
    for s in rest:
      if s and s.isdisjoint(first):
        sets.append(s)
      else:
        first |= s
        merged = True
    if merged: sets.append(first)
    else: results.append(first)
  return results
查看更多
看风景的人
3楼-- · 2019-01-01 16:21

Using Matrix Manipulations

Let me preface this answer with the following comment:

THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.

That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:

from numpy import where, newaxis
from scipy import linalg, array, zeros

X = [[0,1,3],[2],[3,1]]

We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:

A = zeros((4,len(X)), dtype=float)
for i,row in enumerate(X):
    for val in row: A[val,i] = 1

In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.

S  = linalg.svd(A)

We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.

M  = abs(S[2])

We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.

M /=  M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)
V = abs(V)

Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:

idx = where(U > .999)[0]
C = V.T[idx] > 0

I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:

for cluster in C:
    print where(A[:,cluster].sum(axis=1))[0]

Which gives, as intended:

[0 1 3]
[2]

Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].


Addendum

Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.


Visual Representation

Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!

My Example Your sample data


Faster Implementation

In hindsight, I realized that you can skip the SVD step and compute only a single decomp:

M = dot(A.T,A)
M /=  M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)

The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).

查看更多
素衣白纱
4楼-- · 2019-01-01 16:25
lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx as nx
g = nx.Graph()

for sub_list in lists:
    for i in range(1,len(sub_list)):
        g.add_edge(sub_list[0],sub_list[i])

print nx.connected_components(g)
#[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]

Performance:

5000 lists, 5 classes, average size 74, max size 1000
15.2264976415

Performance of merge1:

print timeit.timeit("merge1(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26998780571

So it is 11x slower than the fastest.. but the code is much more simple and readable!

查看更多
登录 后发表回答