Given dataset 1
name,x,y
st. peter,1,2
big university portland,3,4
and dataset 2
name,x,y
saint peter3,4
uni portland,5,6
The goal is to merge on
d1.merge(d2, on="name", how="left")
There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.
For example, st. peter
might match saint peter
in the other, but big university portland
might be too much of a deviation that we wouldn't match it with uni portland
.
One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint
is 4).
The resulting dataframe should only contain the row st. peter
, and contain both "name" variations, and both x
and y
variables.
Is there a way to do this kind of merging using pandas?
Let's say you have that function which returns the best match if any, None otherwise:
Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):
Did you look at fuzzywuzzy?
You might do something like:
Caveat Emptor: I haven't tried to run this.
The simplest idea I can get now is to create special dataframe with distances between all names: