I am learning fuzzywuzzy in python, understand the concept of fuzz.ratio, fuzz.partial_ratio, fuzz.token_sort_ratio and fuzz.token_set_ratio. My question is when to use which function? Do I check the 2 strings' length first, say if not similar, then rule out fuzz.partial_ratio? OR if the 2 strings' length are similar, I'll use fuzz.token_sort_ratio? OR I should always use fuzz.token_set_ratio?
Anyone knows what criteria SeatGeek uses?
I am trying to build a real estate website, thinking to use fuzzywuzzy to compare addresses.
any insight is much appreciated.
R.
Great question.
I'm an engineer at SeatGeek, so I think I can help here. We have a great blog post that explains the differences quite well, but I can summarize and offer some insight into how we use the different types.
Overview
Under the hood each of the four methods calculate the edit distance between some ordering of the tokens in both input strings. This is done using the
difflib.ratio
function which will:The four fuzzywuzzy methods call
difflib.ratio
on different combinations of the input strings.fuzz.ratio
Simple. Just calls
difflib.ratio
on the two input strings (code).fuzz.partial_ratio
Attempts to account for partial string matches better. Calls
ratio
using the shortest string (length n) against all n-length substrings of the larger string and returns the highest score (code).Notice here that "YANKEES" is the shortest string (length 7), and we run the ratio with "YANKEES" against all substrings of length 7 of "NEW YORK YANKEES" (which would include checking against "YANKEES", a 100% match):
fuzz.token_sort_ratio
Attempts to account for similar strings out of order. Calls
ratio
on both strings after sorting the tokens in each string (code). Notice herefuzz.ratio
andfuzz.partial_ratio
both fail, but once you sort the tokens it's a 100% match:fuzz.token_set_ratio
Attempts to rule out differences in the strings. Calls ratio on three particular substring sets and returns the max (code):
Notice that by splitting up the intersection and remainders of the two strings, we're accounting for both how similar and different the two strings are:
Application
This is where the magic happens. At SeatGeek, essentially we create a vector score with each ratio for each data point (venue, event name, etc) and use that to inform programatic decisions of similarity that are specific to our problem domain.
That being said, truth by told it doesn't sound like FuzzyWuzzy is useful for your use case. It will be tremendiously bad at determining if two addresses are similar. Consider two possible addresses for SeatGeek HQ: "235 Park Ave Floor 12" and "235 Park Ave S. Floor 12":
FuzzyWuzzy gives these strings a high match score, but one address is our actual office near Union Square and the other is on the other side of Grand Central.
For your problem you would be better to use the Google Geocoding API.
As of June 2017,
fuzzywuzzy
also includes some other comparison functions. Here is an overview of the ones missing from the accepted answer (taken from the source code):fuzz.partial_token_sort_ratio
Same algorithm as in
token_sort_ratio
, but instead of applyingratio
after sorting the tokens, usespartial_ratio
.fuzz.partial_token_set_ratio
Same algorithm as in
token_set_ratio
, but instead of applyingratio
to the sets of tokens, usespartial_ratio
.fuzz.QRatio, fuzz.UQRatio
Just wrappers around
fuzz.ratio
with some validation and short-circuiting, included here for completeness.UQRatio
is a unicode version ofQRatio
.fuzz.WRatio
An attempt to weight (the name stands for 'Weighted Ratio') results from different algorithms to calculate the 'best' score. Description from the source code:
fuzz.UWRatio
Unicode version of
WRatio
.