How do I check if there are duplicates in a flat l

2019-01-04 19:17发布

For example, given the list ['one', 'two', 'one'], the algorithm should return True, whereas given ['one', 'two', 'three'] it should return False.

7条回答
别忘想泡老子
2楼-- · 2019-01-04 19:25

Use set() to remove duplicates if all values are hashable:

>>> your_list = ['one', 'two', 'one']
>>> len(your_list) != len(set(your_list))
True
查看更多
戒情不戒烟
3楼-- · 2019-01-04 19:33

If you are fond of functional programming style, here is a useful function, self-documented and tested code using doctest.

def decompose(a_list):
    """Turns a list into a set of all elements and a set of duplicated elements.

    Returns a pair of sets. The first one contains elements
    that are found at least once in the list. The second one
    contains elements that appear more than once.

    >>> decompose([1,2,3,5,3,2,6])
    (set([1, 2, 3, 5, 6]), set([2, 3]))
    """
    return reduce(
        lambda (u, d), o : (u.union([o]), d.union(u.intersection([o]))),
        a_list,
        (set(), set()))

if __name__ == "__main__":
    import doctest
    doctest.testmod()

From there you can test unicity by checking whether the second element of the returned pair is empty:

def is_set(l):
    """Test if there is no duplicate element in l.

    >>> is_set([1,2,3])
    True
    >>> is_set([1,2,1])
    False
    >>> is_set([])
    True
    """
    return not decompose(l)[1]

Note that this is not efficient since you are explicitly constructing the decomposition. But along the line of using reduce, you can come up to something equivalent (but slightly less efficient) to answer 5:

def is_set(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False
查看更多
一夜七次
4楼-- · 2019-01-04 19:39

This is old, but the answers here led me to a slightly different solution. If you are up for abusing comprehensions, you can get short-circuiting this way.

xs = [1, 2, 1]
s = set()
any(x in s or s.add(x) for x in xs)
# You can use a similar approach to actually retrieve the duplicates.
s = set()
duplicates = set(x for x in xs if x in s or s.add(x))
查看更多
做自己的国王
5楼-- · 2019-01-04 19:44

I recently answered a related question to establish all the duplicates in a list, using a generator. It has the advantage that if used just to establish 'if there is a duplicate' then you just need to get the first item and the rest can be ignored, which is the ultimate shortcut.

This is an interesting set based approach I adapted straight from moooeeeep:

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

Accordingly, a full list of dupes would be list(getDupes(etc)). To simply test "if" there is a dupe, it should be wrapped as follows:

def hasDupes(l):
    try:
        if getDupes(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

This scales well and provides consistent operating times wherever the dupe is in the list -- I tested with lists of up to 1m entries. If you know something about the data, specifically, that dupes are likely to show up in the first half, or other things that let you skew your requirements, like needing to get the actual dupes, then there are a couple of really alternative dupe locators that might outperform. The two I recommend are...

Simple dict based approach, very readable:

def getDupes(c):
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

Leverage itertools (essentially an ifilter/izip/tee) on the sorted list, very efficient if you are getting all the dupes though not as quick to get just the first:

def getDupes(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.ifilter(lambda x: x[0]==x[1], itertools.izip(a, b)):
        if k != r:
            yield k
            r = k

These were the top performers from the approaches I tried for the full dupe list, with the first dupe occurring anywhere in a 1m element list from the start to the middle. It was surprising how little overhead the sort step added. Your mileage may vary, but here are my specific timed results:

Finding FIRST duplicate, single dupe places "n" elements in to 1m element array

Test set len change :        50 -  . . . . .  -- 0.002
Test in dict        :        50 -  . . . . .  -- 0.002
Test in set         :        50 -  . . . . .  -- 0.002
Test sort/adjacent  :        50 -  . . . . .  -- 0.023
Test sort/groupby   :        50 -  . . . . .  -- 0.026
Test sort/zip       :        50 -  . . . . .  -- 1.102
Test sort/izip      :        50 -  . . . . .  -- 0.035
Test sort/tee/izip  :        50 -  . . . . .  -- 0.024
Test moooeeeep      :        50 -  . . . . .  -- 0.001 *
Test iter*/sorted   :        50 -  . . . . .  -- 0.027

Test set len change :      5000 -  . . . . .  -- 0.017
Test in dict        :      5000 -  . . . . .  -- 0.003 *
Test in set         :      5000 -  . . . . .  -- 0.004
Test sort/adjacent  :      5000 -  . . . . .  -- 0.031
Test sort/groupby   :      5000 -  . . . . .  -- 0.035
Test sort/zip       :      5000 -  . . . . .  -- 1.080
Test sort/izip      :      5000 -  . . . . .  -- 0.043
Test sort/tee/izip  :      5000 -  . . . . .  -- 0.031
Test moooeeeep      :      5000 -  . . . . .  -- 0.003 *
Test iter*/sorted   :      5000 -  . . . . .  -- 0.031

Test set len change :     50000 -  . . . . .  -- 0.035
Test in dict        :     50000 -  . . . . .  -- 0.023
Test in set         :     50000 -  . . . . .  -- 0.023
Test sort/adjacent  :     50000 -  . . . . .  -- 0.036
Test sort/groupby   :     50000 -  . . . . .  -- 0.134
Test sort/zip       :     50000 -  . . . . .  -- 1.121
Test sort/izip      :     50000 -  . . . . .  -- 0.054
Test sort/tee/izip  :     50000 -  . . . . .  -- 0.045
Test moooeeeep      :     50000 -  . . . . .  -- 0.019 *
Test iter*/sorted   :     50000 -  . . . . .  -- 0.055

Test set len change :    500000 -  . . . . .  -- 0.249
Test in dict        :    500000 -  . . . . .  -- 0.145
Test in set         :    500000 -  . . . . .  -- 0.165
Test sort/adjacent  :    500000 -  . . . . .  -- 0.139
Test sort/groupby   :    500000 -  . . . . .  -- 1.138
Test sort/zip       :    500000 -  . . . . .  -- 1.159
Test sort/izip      :    500000 -  . . . . .  -- 0.126
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.120 *
Test moooeeeep      :    500000 -  . . . . .  -- 0.131
Test iter*/sorted   :    500000 -  . . . . .  -- 0.157
查看更多
仙女界的扛把子
6楼-- · 2019-01-04 19:44

I found this to do the best performance because it short-circuit the operation when the first duplicated it found, then this algorithm has time and space complexity O(n) where n is the list's length:

def has_duplicated_elements(self, iterable):
    """ Given an `iterable`, return True if there are duplicated entries. """
    clean_elements_set = set()
    clean_elements_set_add = clean_elements_set.add

    for possible_duplicate_element in iterable:

        if possible_duplicate_element in clean_elements_set:
            return True

        else:
            clean_elements_set_add( possible_duplicate_element )

    return False
查看更多
老娘就宠你
7楼-- · 2019-01-04 19:45

Another way of doing this succinctly is with Counter.

To just determine if there are any duplicates in the original list:

from collections import Counter

def has_dupes(l):
    # second element of the tuple has number of repetitions
    return Counter(l).most_common()[0][1] > 1

Or to get a list of items that have duplicates:

def get_dupes(l):
    return [k for k, v in Counter(l).items() if v > 1]
查看更多
登录 后发表回答