I have a dictionary with values as lists of text values. (ID : [text values]) Below is an excerpt.
data_dictionary = {
52384: ['text2015', 'webnet'],
18720: ['datascience', 'bigdata', 'links'],
82465: ['biological', 'biomedics', 'datamining', 'datamodel', 'semantics'],
73120: ['links', 'scientometrics'],
22276: ['text2015', 'webnet'],
97376: ['text2015', 'webnet'],
43424: ['biological', 'biomedics', 'datamining', 'datamodel', 'semantics'],
23297: ['links', 'scientometrics'],
45233: ['webnet', 'conference', 'links']
}
I created a default dictionary to show the text values that are unique and their lists of unique keys.
dd = defaultdict(list)
for k, v in dictionary_name.items():
dd[tuple(v)].append(k)
Which gave the resulting list of unique IDs and their text values:
{('text2015', 'webnet'): [52384, 22276, 97376], ('datascience', 'bigdata', 'links'): [18720], ('biological', 'biomedics', 'datamining', 'datamodel', 'semantics'): [82465, 43424], ('links', 'scientometrics'): [73120, 23297]}
)
Each of these keys has a sum which I extract from the sum_dictionary.
def extract_sum(key_id, sum_dictionary):
for k,v in sum_dictionary.items():
if key_id == k:
k_sum = v
return k_sum
The extracted sum dictionary can be found here.
sum_dict = { 52384:1444856137000,18720:1444859841000, 82465:1444856, 22276:1674856137000, 97376:1812856137000,43424:5183856,23297:1614481000, 45233:1276781300}
I want to output files that have one or more similar text values including if one value has more or less of the shared text values. And to get a result that is in the form of:
ID_1 ; ID_2 ; Sum_for_ID_1 ; Sum_for_ID_2 ; [one or more shared text values between ID_1 and ID_2]
where Sum_for_ID_1 < Sum_for_ID_2
45233 ; 52384 ; 1276781300 ; 1444856137000 ; ['webnet']
52384 ; 97376 ; 1444856137000 ; 1812856137000 ; ['text2015', 'webnet']
18720 ; 18720 ; 1444859841000 ; 1444859841000 ; ['datascience','bigdata', 'links']
73120 ; 23297 ; 144481000 ; 1614481000 ; ['links', 'scientometrics']
(per line)
I tried using itertools to find all combinations of all the words in the dictionary values but the iterations take too much time to work out.
I thought about running a set method over the keys as well to find similar values. Any ideas would really help.