Removing Duplicates From Dictionary

2020-01-24 05:24发布

I have the following Python 2.7 dictionary data structure (I do not control source data - comes from another system as is):

{112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   },
 112765083670: ...
}

The dictionary keys will always be unique. Dst, src, and alias can be duplicates. All records will always have a dst and src but not every record will necessarily have an alias as seen in the third record.

In the sample data either of the first two records would be removed (doesn't matter to me which one). The third record would be considered unique since although dst and src are the same it is missing alias.

My goal is to remove all records where the dst, src, and alias have all been duplicated - regardless of the key.

How does this rookie accomplish this?

Also, my limited understanding of Python interprets the data structure as a dictionary with the values stored in dictionaries... a dict of dicts, is this correct?

10条回答
再贱就再见
2楼-- · 2020-01-24 05:49
dups={}

for key,val in dct.iteritems():
    if val.get('alias') != None:
        ref = "%s%s%s" % (val['dst'] , val['src'] ,val['alias'])# a simple hash
        dups.setdefault(ref,[]) 
        dups[ref].append(key)

for k,v in dups.iteritems():
    if len(v) > 1:
        for key in v:
            del dct[key]
查看更多
The star\"
3楼-- · 2020-01-24 05:55

One simple approach would be to create a reverse dictionary using the concatenation of the string data in each inner dictionary as a key. So say you have the above data in a dictionary, d:

>>> import collections
>>> reverse_d = collections.defaultdict(list)
>>> for key, inner_d in d.iteritems():
...     key_str = ''.join(inner_d[k][0] for k in ['dst', 'src', 'alias'] if k in inner_d)
...     reverse_d[key_str].append(key)
... 
>>> duplicates = [keys for key_str, keys in reverse_d.iteritems() if len(keys) > 1]
>>> duplicates
[[112762853385, 112762853378]]

If you don't want a list of duplicates or anything like that, but just want to create a duplicate-less dict, you could just use a regular dictionary instead of a defaultdict and re-reverse it like so:

>>> for key, inner_d in d.iteritems():
...     key_str = ''.join(inner_d[k][0] for k in ['dst', 'src', 'alias'] if k in inner_d)
...     reverse_d[key_str] = key
>>> new_d = dict((val, d[val]) for val in reverse_d.itervalues())
查看更多
够拽才男人
4楼-- · 2020-01-24 05:56

You can use

set(dictionary) 

to solve your problem.

查看更多
神经病院院长
5楼-- · 2020-01-24 06:00
input_raw = {112762853378:  {'dst': ['10.121.4.136'],
                             'src': ['1.2.3.4'],
                             'alias': ['www.example.com']    },
             112762853385:  {'dst': ['10.121.4.136'],
                             'src': ['1.2.3.4'],
                             'alias': ['www.example.com']    },
             112760496444:  {'dst': ['10.121.4.299'],
                             'src': ['1.2.3.4']    },
             112760496502:  {'dst': ['10.122.195.34'],
                             'src': ['4.3.2.1']    },
             112758601487:  {'src': ['1.2.3.4'],
                             'alias': ['www.example.com'],
                             'dst': ['10.121.4.136']},
             112757412898:  {'dst': ['10.122.195.34'],
                             'src': ['4.3.2.1']    },
             112757354733:  {'dst': ['124.12.13.14'],
                             'src': ['8.5.6.0']},             
             }

for x in input_raw.iteritems():
    print x
print '\n---------------------------\n'

seen = []

for k,val in input_raw.items():
    if val in seen:
        del input_raw[k]
    else:
        seen.append(val)


for x in input_raw.iteritems():
    print x

result

(112762853385L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757354733L, {'src': ['8.5.6.0'], 'dst': ['124.12.13.14']})
(112758601487L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757412898L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496502L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496444L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.299']})
(112762853378L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})

---------------------------

(112762853385L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757354733L, {'src': ['8.5.6.0'], 'dst': ['124.12.13.14']})
(112757412898L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496444L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.299']})

The facts that this solution creates first a list input_raw.iteritems() (as in Andrew's Cox's answer) and requires a growing list seen are drawbacks.
But the first can't be avoided (using iteritems() doesn't work) and the second is less heavy than re-creating a list result.values() from growing list result for each turn of a loop.

查看更多
混吃等死
6楼-- · 2020-01-24 06:00

Another reverse dict variation:

>>> import pprint
>>> 
>>> data = {
...   112762853378: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4'], 
...     'alias': ['www.example.com']
...    },
...  112762853385: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4'], 
...     'alias': ['www.example.com']
...    },
...  112760496444: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4']
...    },
...  112760496502: 
...    {'dst': ['10.122.195.34'], 
...     'src': ['4.3.2.1']
...    },
... }
>>> 
>>> keep = set({repr(sorted(value.items())):key
...             for key,value in data.iteritems()}.values())
>>> 
>>> for key in data.keys():
...     if key not in keep:
...         del data[key]
... 
>>> 
>>> pprint.pprint(data)
{112760496444L: {'dst': ['10.121.4.136'], 'src': ['1.2.3.4']},
 112760496502L: {'dst': ['10.122.195.34'], 'src': ['4.3.2.1']},
 112762853378L: {'alias': ['www.example.com'],
                 'dst': ['10.121.4.136'],
                 'src': ['1.2.3.4']}}
查看更多
乱世女痞
7楼-- · 2020-01-24 06:01

You could go though each of the items (the key value pair) in the dictionary and add them into a result dictionary if the value was not already in the result dictionary.

input_raw = {112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   }
}

result = {}

for key,value in input_raw.items():
    if value not in result.values():
        result[key] = value

print result
查看更多
登录 后发表回答