I'm looking for a gem or project that would let me identify that two names are the same person. For example
J.R. Smith == John R. Smith == John Smith == John Roy Smith == Johnny Smith
I think you get the idea. I know nothing is going to be 100% accurate but I'd like to get something that at least handles the majority of cases. I know that last one is probably going to need a database of nicknames.
I think one option would be to use a ruby implementation of the Levenshtein distance
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Then you could define that names with a distance less than X (being X a number you will have to tweak) are from the same person.
EDIT
Through a little search I was able to find another algorithm, based on phonetics called Metaphone
Still has a lot of holes in it, but I think that in this case the best everyone can do is to give you alternatives for you to test and see what works best
This is a little late (and a shameless plug to boot), but for what it's worth, I wrote a human name parser during a GSoC project, which you can install with gem install namae
. It does not detect your duplicates reliably obviously, but it helps you with such kind of tasks.
For instance, you can parse the names in your example and use a display form using initials to detect names whose initials are identical, and so on and so forth:
names = Namae.parse('J.R. Smith and John R. Smith and John Smith and John Roy Smith and Johnny Smith ')
names.map { |n| [n.given, n.family] }
#=> => [["J.R.", "Smith"], ["John R.", "Smith"], ["John", "Smith"], ["John Roy", "Smith"], ["Johnny", "Smith"]]
names.map { |n| n.initials expand: true }
#=> ["J.R. Smith", "J.R. Smith", "J. Smith", "J.R. Smith", "J. Smith"]
Something like:
1: Convert names to arrays:
irb> names.map!{|n|n.scan(/[^\s.]+\.?/)}
["J.", "R.", "Smith"]
["John", "R.", "Smith"]
["John", "Smith"]
["John", "Roy", "Smith"]
["Johnny", "Smith"]
2: Some function of identity:
for a,b in names.combination(2)
p [(a&b).size,a,b]
end
[2, ["J.", "R.", "Smith"], ["John", "R.", "Smith"]]
[1, ["J.", "R.", "Smith"], ["John", "Smith"]]
[1, ["J.", "R.", "Smith"], ["John", "Roy", "Smith"]]
[1, ["J.", "R.", "Smith"], ["Johnny", "Smith"]]
[2, ["John", "R.", "Smith"], ["John", "Smith"]]
[2, ["John", "R.", "Smith"], ["John", "Roy", "Smith"]]
[1, ["John", "R.", "Smith"], ["Johnny", "Smith"]]
[2, ["John", "Smith"], ["John", "Roy", "Smith"]]
[1, ["John", "Smith"], ["Johnny", "Smith"]]
[1, ["John", "Roy", "Smith"], ["Johnny", "Smith"]]
Or instead of &
you may use .permutation
+ .zip
+ .max
to apply some custom function, which determines, are to parts of names identical.
UPD:
aim = 'Rob Bobbie Johnson'
candidates = [
"Bob Robbie John",
"Bobbie J. Roberto",
"R.J.B.",
]
$synonyms = Hash[ [
["bob",["bobbie"]],
["rob",["robbie","roberto"]],
] ]
def prepare name
name.scan(/[^\s.]+\.?/).map &:downcase
end
def mf a,b # magick function
a.zip(b).map do |i,j|
next 1 if i == j
next 0.9 if $synonyms[i].to_a.include?(j) || $synonyms[j].to_a.include?(i)
next 0.5 if i[/\.$/] && j.start_with?(i.chomp '.')
next 0.5 if j[/\.$/] && i.start_with?(j.chomp '.')
-10 # if some part of name appears to be different -
# it's bad even if another two parts were good
end.inject :+
end
for c in candidates
results = prepare(c).permutation.map do |per|
[mf(prepare(aim),per),per]
end
p [results.transpose.first.max,c]
end
[-8.2, "Bob Robbie John"] # 0.9 + 0.9 - 10 # Johnson != John # I think ..)
[2.4, "Bobbie J. Roberto"] # 1 + 0.9 + 0.5 # Rob == Roberto, Bobbie == Bobbie, Johnson ~~ J.
[1.5, "R.J.B."] # 0.5 + 0.5 + 0.5
For anyone who has to try to match human names from different data sources, this is a VERY hard problem to address. Using a combination of 3 gems seems to do pretty well.
We have an application where we have a million people in List A, and need to match them with dozens of different data sources. (And despite what some of the more pedantic comments claim, that is not a 'design flaw' that is the nature of dealing with 'real world' messy data.)
The only thing we have found to work reasonably well thus far is using a combination of the namae
gem (for parsing names into a standardize first, middle, last, suffix representation) and the text
gem to calculate levenshtein, soundex, metaphone, and porter scores, AND also fuzzy-string-match
which calculates the JaroWinkler score (which is often the best of the lot).
- parse into a standard format separating last, first, middle, suffix using namae. We pre-process with a regex to extract nicknames when formatted
John "JJ" Doe
or Samuel (Sammy) Smith
- calculate ALL scores on a sanitized version of the full name (all caps, remove punctuation, last name first) ... jarowinkler, soundex, levenshtein, metaphone, white, porter. (JaroWinkler and Soundex often do the best.)
- declare a match if N scores exceed individually set thresholds. (We use any 2 that pass as a pass)
- if no match, try again using only last name, first name, middle initial, with higher thresholds (eg, stricter matching).
- Still no match, replace first name with nick name (if any) and try again.
With some tweaking of score thresholds for each scoring method, we get pretty good results. YMMV.
BTW putting last name first is very important, at least for JaroWinkler since there is generally less variation in last names (Smithe is almost always Smithe, but first name might be Tom or Tommy or Thomas in different data sources), and the beginning of the string is most 'sensitive' in JaroWinkler. For a "ROB SMITHE / ROBIN SMITHE, the JaroWinkler distance is 0.91 if you do first name first, but 0.99 if you do last name first.
The best pre-coded you will probably find for this is the gem just called "text".
https://github.com/threedaymonk/text
It has a number of matching algorithms: Levenshtein Distance, Metaphone, Soundex, and more.
I don't think such a library exists.
I don't mean to offend, but this problem seems like it arises from poor design. Maybe if you post more details about the general problem you are trying to solve, people can suggest a better way.
Ruby has a very nice gem called text
and I've found the Text::WhiteSimilarity
to be very good myself but it also implements a bunch of other tests
One initial attempt at a robust human name matcher / clustering solution in Ruby: https://github.com/adrianomitre/match_author_names