This question is the culmination of two pieces of code guided by two answers here on SO. The first question I had was how to compare similarity between two strings and I got a good answer as seen here with the following code:
code 1
def get_bigrams(string):
'''
Takes a string and returns a list of bigrams
'''
s = string.lower()
return { s[i:i+2] for i in range(len(s) - 1) }
def string_similarity(str1, str2):
'''
Perform bigram comparison between two strings
and return a percentage match in decimal form
'''
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
intersection = set(pairs1) & set(pairs2)
return (2.0 * len(intersection)) / (len(pairs1) + len(pairs2))
After that I needed a way to sort the list of names for me to run them through the above code. I got the code here as seen below:
code 2
import itertools
persons = ["Peter parker", "Richard Parker", "Parker Richard", "Aunt May"]
similarity = []
for p1, p2 in itertools.combinations(persons, 2):
similarity.append(string_similarity(p1,p2))
print("%s - %s: " %(p1, p2) + " " + str(string_similarity(p1, p2)))
similarity = sorted(similarity, key=float)
print(similarity)
Now, the final hurdle is that my data is not in a list and is actually fetched from a database with primary keys which is what I ultimately want to track. Meaning when I compare multiple names, I need to mark that e.g. ID 1 and ID 2 are the most variant. For me to determine that those two IDs are the most variant, I need to sort the result of 'code1` above which looks like below:
Peter parker - Richard Parker: 0.5454545454545454
Peter parker - Parker Richard: 0.5454545454545454
Peter parker - Aunt May: 0.0
Richard Parker - Parker Richard: 0.8333333333333334
Richard Parker - Aunt May: 0.0
Parker Richard - Aunt May: 0.0
[0.0, 0.0, 0.0, 0.5454545454545454, 0.5454545454545454, 0.8333333333333334]
In my head instead of those names there I need the Primary IDs with which the names were fetched with so am thinking using a dictionary. Is there a way to run a dictionary of {PID:Name}, {PID1:Name1}, PID2:Name2} using code2
, get the similarity value using code1
, sort the result and then know that names with the highest similarity are PID1 and PID3? Or is there a more elegant and less hair pulling way than am currently thinking...
Yes, you need to associate the pair (ID, name). For this you can use a dict, a tuple or even a class. For example using tuples your code 2 would change to:
You get: