How to efficiently find the indices of matching el

2019-03-18 15:32发布

站内文章 / Python

37 0

闹够了就滚

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am working on two large data sets, and my question is as follows.

Suppose I have two lists:

list1 = [A,B,C,D]

list2 = [B,D,A,G]

How can I efficiently find the matching index, using Python, other than O(n²) searching? The result should look like:

matching_index(list1,list2) -> [(0,2),(1,0),(3,1)]

回答1:

Without duplicates

If your objects are hashable and your lists have no duplicates, you can create an inverted index of the first list and then traverse the second list. This traverses each list only once and thus is O(n).

def find_matching_index(list1, list2):

    inverse_index = { element: index for index, element in enumerate(list1) }

    return [(index, inverse_index[element])
        for index, element in enumerate(list2) if element in inverse_index]

find_matching_index([1,2,3], [3,2,1]) # [(0, 2), (1, 1), (2, 0)]

With duplicates

You can extend the previous solution to account for duplicates. You can keep track of multiple indices with a set.

def find_matching_index(list1, list2):

    # Create an inverse index which keys are now sets
    inverse_index = {}

    for index, element in enumerate(list1):

        if element not in inverse_index:
            inverse_index[element] = {index}

        else:
            inverse_index[element].add(index)

    # Traverse the second list    
    matching_index = []

    for index, element in enumerate(list2):

        # We have to create one pair by element in the set of the inverse index
        if element in inverse_index:
            matching_index.extend([(x, index) for x in inverse_index[element]])

    return matching_index

find_matching_index([1, 1, 2], [2, 2, 1]) # [(2, 0), (2, 1), (0, 2), (1, 2)]

Unfortunately, this is no longer O(n). Consider the case where you input [1, 1] and [1, 1], the output is [(0, 0), (0, 1), (1, 0), (1, 1)]. Thus by the size of the output, the worst case cannot be better than O(n^2).

Although, this solution is still O(n) if there are no duplicates.

Non-hashable objects

Now comes the case where your objects are not hashable, but comparable. The idea here will be to sort your lists in a way that preserves the origin index of each element. Then we can group sequences of elements that are equal to get matching indices.

Since we make heavy use of groupby and product in the following code, I made find_matching_index return a generator for memory efficiency on long lists.

from itertools import groupby, product

def find_matching_index(list1, list2):
    sorted_list1 = sorted((element, index) for index, element in enumerate(list1))
    sorted_list2 = sorted((element, index) for index, element in enumerate(list2))

    list1_groups = groupby(sorted_list1, key=lambda pair: pair[0])
    list2_groups = groupby(sorted_list2, key=lambda pair: pair[0])

    for element1, group1 in list1_groups:
        try:
            element2, group2 = next(list2_groups)
            while element1 > element2:
                (element2, _), group2 = next(list2_groups)

        except StopIteration:
            break

        if element2 > element1:
            continue

        indices_product = product((i for _, i in group1), (i for _, i in group2), repeat=1)

        yield from indices_product

        # In version prior to 3.3, the above line must be
        # for x in indices_product:
        #     yield x

list1 = [[], [1, 2], []]
list2 = [[1, 2], []]

list(find_matching_index(list1, list2)) # [(0, 1), (2, 1), (1, 0)]

It turns out that time complexity does not suffer that much. Sorting of course takes O(n log(n)), but then groupby provides generators that can recover all elements by traversing our lists only twice. The conclusion is that our complexity is primarly bound by the size of the output of product. Thus giving a best case where the algorithm is O(n log(n)) and a worst case that is once again O(n^2).

回答2:

If your objects are not hashable, but still orderable, you might wanna consider using sorted to match both lists

Assuming all elements in both lists have a match

You can sort the lists indexes and pair the results

indexes1 = sorted(range(len(list1)), key=lambda x: list1[x])
indexes2 = sorted(range(len(list2)), key=lambda x: list2[x])
matches = zip(indexes1, indexes2)

If not all elements match, but there are no duplicates within each list

You can sort both at the same time and keep the indexes while sorting. Then if you catch any consecutive duplicates, you know they are from different lists

biglist = list(enumerate(list1)) + list(enumerate(list2))
biglist.sort(key=lambda x: x[1])
matches = [(biglist[i][0], biglist[i + 1][0]) for i in range(len(biglist) - 1) if biglist[i][1] == biglist[i + 1][1]]

回答3:

One brute-force answer to this problem, if for no other reason than to validate any solution, is given by:

[(xi, xp) for (xi, x) in enumerate(list1) for (xp, y) in enumerate(list2) if x==y]

How you will have to optimize this depends in large part on data volumes and memory capacity, so some idea of how large these lists are might be helpful. I'd imagine the method I discuss below would be good for lists with millions of values at least.

Since dictionary access is O(1), it would seem worth trying to map the elements in the second list to their positions. Assuming the same element can be repeated, a collections.defaultdict will easily allow us to construct the necessary dict.

l2_pos = defaultdict(list)
for (p, k) in enumerate(list2):
    l2_pos[k].append(p)

The expression l2_pos[k] is now a list of the positions in list2 at which element k occurs. It only remains to pair each of these with the positions of the corresponding keys in list1. The result in list form is

[(p1, p2) for (p1, k) in enumerate(list1) for p2 in l2_pos[k]]

If these structures are large, however, you might be better served by a generator expression. To bind a name to the expression inside the list comprehension above you would write

values = ((p1, p2) for (p1, k) in enumerate(list1) for p2 in l2_pos[k])

If you then iterate over values you avoid the overhead of creating a list containing all the values, thereby reducing load on Python's memory management and garbage collection, which is pretty much all overhead as far as solving your problem is concerned.

When you start to deal with large data volumes, understanding generators can mean the difference between having enough memory to solve your problem or not. In many cases they have a clear advantage over list comprehensions.

EDIT: This technique can be further accelerated by using sets rather than lists to hold the positions, unless the changes in ordering would be harmful. This change is left as an exercise for the reader.

回答4:

Using a dict reduces lookup time and the collections.defaultdict specialization can help with the bookkeeping. The goal is a dict whose values are the indexing pairs you are after. Duplicate values overwrite earlier ones in the list.

import collections

# make a test list
list1 = list('ABCDEFGHIJKLMNOP')
list2 = list1[len(list1)//2:] + list1[:len(list1)//2]

# Map list items to positions as in: [list1_index, list2_index]
# by creating a defaultdict that fills in items not in list1,
# then adding list1 items and updating with with list2 items. 
list_indexer = collections.defaultdict(lambda: [None, None],
 ((item, [i, None]) for i, item in enumerate(list1)))
for i, val in enumerate(list2):
    list_indexer[val][1] = i

print(list(list_indexer.values()))

回答5:

Here is a simple approach with a defaultdict.

Given

import collections as ct


lst1 = list("ABCD")
lst2 = list("BDAG")
lst3 = list("EAB")
str1 = "ABCD"

Code

def find_matching_indices(*iterables, pred=None):
    """Return a list of matched indices across `m` iterables."""
    if pred is None:
        pred = lambda x: x[0]

    # Dict insertion
    dd = ct.defaultdict(list)
    for lst in iterables:                                          # O(m)
        for i, x in enumerate(lst):                                # O(n)
            dd[x].append(i)                                        # O(1)

    # Filter + sort
    vals = (x for x in dd.values() if len(x) > 1)                  # O(n)
    return sorted(vals, key=pred)                                  # O(n log n)

Demo

Find matches in two lists (per OP):

find_matching_indices(lst1, lst2)
# [[0, 2], [1, 0], [3, 1]]

Sort by a different resulting index:

find_matching_indices(lst1, lst2, pred=lambda x: x[1])
# [[1, 0], [3, 1], [0, 2]]

Match items in more than two iterables (of optionally variable length):

find_matching_indices(lst1, lst2, lst3, str1)
# [[0, 2, 1, 0], [1, 0, 2, 1], [2, 2], [3, 1, 3]]

Details

Dictionary insertion

Each item is appended to the lists of the defaultdict. The result looks something like this, which is later filtered:

defaultdict(list, {'A': [0, 2], 'B': [1, 0], 'C': [2], 'D': [3, 1], 'G': [3]})

At first glance, from the double for loops one might be tempted to say the time complexity is O(n²). However, the list of containers in the outer loop has a length m. The inner loop processes the elements of each container of length n. I am not certain what the final complexity is, but based on this answer, I suspect it to be O(n*m) or at least below O(n²).

Filtering

Non-matches (lists of length 1) are filtered out, and results are sorted (primarily for disordered dicts in Python < 3.6).

Using the timsort algorithm via sorted to sort dict values (lists) by some index, the worse case is O(n log n). Since dict key insertion is preserved in Python 3.6+, the pre-sorted items reduces the complexity O(n).

Overall, the best case time complexity is O(n); the worst case is O(n log n) if using sorted in Python < 3.6, otherwise it is O(n*m).