Python: simple list merging based on intersections

2019-01-01 15:59发布

Consider there are some lists of integers as:

#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------

The question is to merge lists having at least one common element. So the results only for the given part will be as follows:

#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------

What is the most efficient way to do this on large data (elements are just numbers)? Is tree structure something to think about? I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1: Here is the code I was using:

#--------------------------------------
lsts = [[0,1,3],
        [1,0,3,4,5,10,11],
        [2,8],
        [3,1,0,16]];
#--------------------------------------

The function is (buggy!!):

#--------------------------------------
def merge(lsts):
    sts = [set(l) for l in lsts]
    i = 0
    while i < len(sts):
        j = i+1
        while j < len(sts):
            if len(sts[i].intersection(sts[j])) > 0:
                sts[i] = sts[i].union(sts[j])
                sts.pop(j)
            else: j += 1                        #---corrected
        i += 1
    lst = [list(s) for s in sts]
    return lst
#--------------------------------------

The result is:

#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------

Update 2: To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting). The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!

Update 3: Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.

Update 4:
There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.

15条回答
十年一品温如言
2楼-- · 2019-01-01 16:07

Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.

def merge(data):
    parents = {}
    def find(i):
        j = parents.get(i, i)
        if j == i:
            return i
        k = find(j)
        if k != j:
            parents[i] = k
        return k
    for l in filter(None, data):
        parents.update(dict.fromkeys(map(find, l), find(l[0])))
    merged = {}
    for k, v in parents.items():
        merged.setdefault(find(v), []).append(k)
    return merged.values()

This approach is comparable to the other best algorithms on Rik's benchmarks.

查看更多
几人难应
3楼-- · 2019-01-01 16:08

My solution, works well on small lists and is quite readable without dependencies.

def merge_list(starting_list):
    final_list = []
    for i,v in enumerate(starting_list[:-1]):
        if set(v)&set(starting_list[i+1]):
            starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
        else:
            final_list.append(v)
    final_list.append(starting_list[-1])
    return final_list

Benchmarking it:

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

%timeit merge_list(lists)

100000 loops, best of 3: 4.9 µs per loop

查看更多
明月照影归
4楼-- · 2019-01-01 16:12

This would be my updated approach:

def merge(data):
    sets = (set(e) for e in data if e)
    results = [next(sets)]
    for e_set in sets:
        to_update = []
        for i,res in enumerate(results):
            if not e_set.isdisjoint(res):
                to_update.insert(0,i)

        if not to_update:
            results.append(e_set)
        else:
            last = results[to_update.pop(-1)]
            for i in to_update:
                last |= results[i]
                del results[i]
            last |= e_set

    return results

Note: During the merging empty lists will be removed.

Update: Reliability.

You need two tests for a 100% reliabilty of success:

  • Check that all the resulting sets are mutually disjointed:

    merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]
    
    from itertools import combinations
    for a,b in combinations(merged,2):
        if not a.isdisjoint(b):
            raise Exception(a,b)    # just an example
    
  • Check that the merged set cover the original data. (as suggested by katrielalex)

I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.

查看更多
姐姐魅力值爆表
5楼-- · 2019-01-01 16:13

My attempt:

def merge(lsts):
    sets = [set(lst) for lst in lsts if lst]
    merged = True
    while merged:
        merged = False
        results = []
        while sets:
            common, rest = sets[0], sets[1:]
            sets = []
            for x in rest:
                if x.isdisjoint(common):
                    sets.append(x)
                else:
                    merged = True
                    common |= x
            results.append(common)
        sets = results
    return sets

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
       [6, 97, 32, 93, 55, 14, 70, 32],
       [75, 37, 83, 34, 9, 19, 14, 64],
       [43, 71],
       [],
       [89, 49, 1, 30, 28, 3, 63],
       [35, 21, 68, 94, 57, 94, 9, 3],
       [16],
       [29, 9, 97, 43],
       [17, 63, 24]]
print merge(lst)

Benchmark:

import random

# adapt parameters to your own usage scenario
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5

if False:  # change to true to generate the test data file (takes a while)
    with open("/tmp/test.txt", "w") as f:
        lists = []
        classes = [
            range(class_size * i, class_size * (i + 1)) for i in range(class_count)
        ]
        for c in classes:
            # distribute each class across ~300 lists
            for i in xrange(list_count_per_class):
                lst = []
                if random.random() < large_list_probability:
                    size = random.choice(large_list_sizes)
                else:
                    size = random.choice(small_list_sizes)
                nums = set(c)
                for j in xrange(size):
                    x = random.choice(list(nums))
                    lst.append(x)
                    nums.remove(x)
                random.shuffle(lst)
                lists.append(lst)
        random.shuffle(lists)
        for lst in lists:
            f.write(" ".join(str(x) for x in lst) + "\n")

setup = """
# Niklas'
def merge_niklas(lsts):
    sets = [set(lst) for lst in lsts if lst]
    merged = 1
    while merged:
        merged = 0
        results = []
        while sets:
            common, rest = sets[0], sets[1:]
            sets = []
            for x in rest:
                if x.isdisjoint(common):
                    sets.append(x)
                else:
                    merged = 1
                    common |= x
            results.append(common)
        sets = results
    return sets

# Rik's
def merge_rik(data):
    sets = (set(e) for e in data if e)
    results = [next(sets)]
    for e_set in sets:
        to_update = []
        for i, res in enumerate(results):
            if not e_set.isdisjoint(res):
                to_update.insert(0, i)

        if not to_update:
            results.append(e_set)
        else:
            last = results[to_update.pop(-1)]
            for i in to_update:
                last |= results[i]
                del results[i]
            last |= e_set
    return results

# katrielalex's
def pairs(lst):
    i = iter(lst)
    first = prev = item = i.next()
    for item in i:
        yield prev, item
        prev = item
    yield item, first

import networkx

def merge_katrielalex(lsts):
    g = networkx.Graph()
    for lst in lsts:
        for edge in pairs(lst):
            g.add_edge(*edge)
    return networkx.connected_components(g)

# agf's (optimized)
from collections import deque

def merge_agf_optimized(lists):
    sets = deque(set(lst) for lst in lists if lst)
    results = []
    disjoint = 0
    current = sets.pop()
    while True:
        merged = False
        newsets = deque()
        for _ in xrange(disjoint, len(sets)):
            this = sets.pop()
            if not current.isdisjoint(this):
                current.update(this)
                merged = True
                disjoint = 0
            else:
                newsets.append(this)
                disjoint += 1
        if sets:
            newsets.extendleft(sets)
        if not merged:
            results.append(current)
            try:
                current = newsets.pop()
            except IndexError:
                break
            disjoint = 0
        sets = newsets
    return results

# agf's (simple)
def merge_agf_simple(lists):
    newsets, sets = [set(lst) for lst in lists if lst], []
    while len(sets) != len(newsets):
        sets, newsets = newsets, []
        for aset in sets:
            for eachset in newsets:
                if not aset.isdisjoint(eachset):
                    eachset.update(aset)
                    break
            else:
                newsets.append(aset)
    return newsets

# alexis'
def merge_alexis(data):
    bins = range(len(data))  # Initialize each bin[n] == n
    nums = dict()

    data = [set(m) for m in data]  # Convert to sets
    for r, row in enumerate(data):
        for num in row:
            if num not in nums:
                # New number: tag it with a pointer to this row's bin
                nums[num] = r
                continue
            else:
                dest = locatebin(bins, nums[num])
                if dest == r:
                    continue  # already in the same bin

                if dest > r:
                    dest, r = r, dest  # always merge into the smallest bin

                data[dest].update(data[r])
                data[r] = None
                # Update our indices to reflect the move
                bins[r] = dest
                r = dest

    # Filter out the empty bins
    have = [m for m in data if m]
    return have

def locatebin(bins, n):
    while bins[n] != n:
        n = bins[n]
    return n

lsts = []
size = 0
num = 0
max = 0
for line in open("/tmp/test.txt", "r"):
    lst = [int(x) for x in line.split()]
    size += len(lst)
    if len(lst) > max:
        max = len(lst)
    num += 1
    lsts.append(lst)
"""

setup += """
print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
""".format(class_count=class_count)

import timeit
print "niklas"
print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
print "rik"
print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
print "katrielalex"
print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
print "agf (1)"
print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
print "agf (2)"
print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
print "alexis"
print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)

These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.

Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:

=====================
# many disjoint classes, large lists
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
=====================

niklas
5000 lists, 50 equally distributed classes, average size 298, max size 999
4.80084705353
rik
5000 lists, 50 equally distributed classes, average size 298, max size 999
9.49251699448
katrielalex
5000 lists, 50 equally distributed classes, average size 298, max size 999
21.5317108631
agf (1)
5000 lists, 50 equally distributed classes, average size 298, max size 999
8.61671280861
agf (2)
5000 lists, 50 equally distributed classes, average size 298, max size 999
5.18117713928
=> alexis
=> 5000 lists, 50 equally distributed classes, average size 298, max size 999
=> 3.73504281044

===================
# less number of classes, large lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
===================

niklas
4500 lists, 15 equally distributed classes, average size 296, max size 999
1.79993700981
rik
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.58237695694
katrielalex
4500 lists, 15 equally distributed classes, average size 296, max size 999
19.5465381145
agf (1)
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.75445604324
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 296, max size 999
=> 1.77850699425
alexis
4500 lists, 15 equally distributed classes, average size 296, max size 999
3.23530197144

===================
# less number of classes, smaller lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.1
===================

niklas
4500 lists, 15 equally distributed classes, average size 95, max size 997
0.773697137833
rik
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.0523750782
katrielalex
4500 lists, 15 equally distributed classes, average size 95, max size 997
6.04466891289
agf (1)
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.20285701752
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 95, max size 997
=> 0.714507102966
alexis
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.1286110878
查看更多
刘海飞了
6楼-- · 2019-01-01 16:15

I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.

I tried to test and time every solution (all the code here).

Testing

This is the TestCase from the testing module:

class MergeTestCase(unittest.TestCase):

    def setUp(self):
        with open('./lists/test_list.txt') as f:
            self.lsts = json.loads(f.read())
        self.merged = self.merge_func(deepcopy(self.lsts))

    def test_disjoint(self):
        """Check disjoint-ness of merged results"""
        from itertools import combinations
        for a,b in combinations(self.merged, 2):
            self.assertTrue(a.isdisjoint(b))

    def test_coverage(self):    # Credit to katrielalex
        """Check coverage original data"""
        merged_flat = set()
        for s in self.merged:
            merged_flat |= s

        original_flat = set()
        for lst in self.lsts:
            original_flat |= set(lst)

        self.assertTrue(merged_flat == original_flat)

    def test_subset(self):      # Credit to WolframH
        """Check that every original data is a subset"""
        for lst in self.lsts:
            self.assertTrue(any(set(lst) <= e for e in self.merged))

This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.

I couldn't test the following:

katrielalex
steabert

Among the ones I could test, two failed:

  -- Going to test: agf (optimized) --
Check disjoint-ness of merged results ... FAIL

  -- Going to test: robert king --
Check disjoint-ness of merged results ... FAIL

Timing

The performances are strongly related with the data test employed.

So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.

  1. Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.

    I've used the same three sets of parameters he used in his own answer, and I put them in three different files:

    filename = './lists/timing_1.txt'
    class_count = 50,
    class_size = 1000,
    list_count_per_class = 100,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.5,
    
    filename = './lists/timing_2.txt'
    class_count = 15,
    class_size = 1000,
    list_count_per_class = 300,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.5,
    
    filename = './lists/timing_3.txt'
    class_count = 15,
    class_size = 1000,
    list_count_per_class = 300,
    large_list_sizes = (100, 1000),
    small_list_sizes = (0, 100),
    large_list_probability = 0.1,
    

    This are the results that I got:

    From file: timing_1.txt

    Timing with: >> Niklas << Benchmark
    Info: 5000 lists, average size 305, max size 999
    
    Timing Results:
    10.434  -- alexis
    11.476  -- agf
    11.555  -- Niklas B.
    13.622  -- Rik. Poggi
    14.016  -- agf (optimized)
    14.057  -- ChessMaster
    20.208  -- katrielalex
    21.697  -- steabert
    25.101  -- robert king
    76.870  -- Sven Marnach
    133.399  -- hochl
    

    From file: timing_2.txt

    Timing with: >> Niklas << Benchmark
    Info: 4500 lists, average size 305, max size 999
    
    Timing Results:
    8.247  -- Niklas B.
    8.286  -- agf
    8.637  -- Rik. Poggi
    8.967  -- alexis
    9.090  -- ChessMaster
    9.091  -- agf (optimized)
    18.186  -- katrielalex
    19.543  -- steabert
    22.852  -- robert king
    70.486  -- Sven Marnach
    104.405  -- hochl
    

    From file: timing_3.txt

    Timing with: >> Niklas << Benchmark
    Info: 4500 lists, average size 98, max size 999
    
    Timing Results:
    2.746  -- agf
    2.850  -- Niklas B.
    2.887  -- Rik. Poggi
    2.972  -- alexis
    3.077  -- ChessMaster
    3.174  -- agf (optimized)
    5.811  -- katrielalex
    7.208  -- robert king
    9.193  -- steabert
    23.536  -- Sven Marnach
    37.436  -- hochl
    
  2. With Sven's testing data I got the following results:

    Timing with: >> Sven << Benchmark
    Info: 200 lists, average size 10, max size 10
    
    Timing Results:
    2.053  -- alexis
    2.199  -- ChessMaster
    2.410  -- agf (optimized)
    3.394  -- agf
    3.398  -- Rik. Poggi
    3.640  -- robert king
    3.719  -- steabert
    3.776  -- Niklas B.
    3.888  -- hochl
    4.610  -- Sven Marnach
    5.018  -- katrielalex
    
  3. And finally with Agf's benchmark I got:

    Timing with: >> Agf << Benchmark
    Info: 2000 lists, average size 246, max size 500
    
    Timing Results:
    3.446  -- Rik. Poggi
    3.500  -- ChessMaster
    3.520  -- agf (optimized)
    3.527  -- Niklas B.
    3.527  -- agf
    3.902  -- hochl
    5.080  -- alexis
    15.997  -- steabert
    16.422  -- katrielalex
    18.317  -- robert king
    1257.152  -- Sven Marnach
    

As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.

Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)

查看更多
梦寄多情
7楼-- · 2019-01-01 16:16

This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.

from collections import deque

def merge(lists):
    sets = deque(set(lst) for lst in lists if lst)
    results = []
    disjoint = 0
    current = sets.pop()
    while True:
        merged = False
        newsets = deque()
        for _ in xrange(disjoint, len(sets)):
            this = sets.pop()
            if not current.isdisjoint(this):
                current.update(this)
                merged = True
                disjoint = 0
            else:
                newsets.append(this)
                disjoint += 1
        if sets:
            newsets.extendleft(sets)
        if not merged:
            results.append(current)
            try:
                current = newsets.pop()
            except IndexError:
                break
            disjoint = 0
        sets = newsets
    return results

The less overlap between the sets in a given set of data, the better this will do compared to the other functions.

Here is an example case. If you have 4 sets, you need to compare:

1, 2
1, 3
1, 4
2, 3
2, 4
3, 4

If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.

There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.

The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.

The disjoint counter allows this to be tracked.


My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.

According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.

def merge0(lists):
    newsets, sets = [set(lst) for lst in lists if lst], []
    while len(sets) != len(newsets):
        sets, newsets = newsets, []
        for aset in sets:
            for eachset in newsets:
                if not aset.isdisjoint(eachset):
                    eachset.update(aset)
                    break
            else:
                newsets.append(aset)
    return newsets

No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.

My (faster and simpler) version of the test code:

import random

tenk = range(10000)
lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

setup = """
def merge0(lists):
  newsets, sets = [set(lst) for lst in lists if lst], []
  while len(sets) != len(newsets):
    sets, newsets = newsets, []
    for aset in sets:
      for eachset in newsets:
        if not aset.isdisjoint(eachset):
          eachset.update(aset)
          break
      else:
        newsets.append(aset)
  return newsets

def merge1(lsts):
  sets = [set(lst) for lst in lsts if lst]
  merged = 1
  while merged:
    merged = 0
    results = []
    while sets:
      common, rest = sets[0], sets[1:]
      sets = []
      for x in rest:
        if x.isdisjoint(common):
          sets.append(x)
        else:
          merged = 1
          common |= x
      results.append(common)
    sets = results
  return sets

lsts = """ + repr(lsts)

import timeit
print timeit.timeit("merge0(lsts)", setup=setup, number=10)
print timeit.timeit("merge1(lsts)", setup=setup, number=10)
查看更多
登录 后发表回答