Is there a general way to convert between a measure of similarity and a measure of distance?
Consider a similarity measure like the number of 2-grams that two strings have in common.
2-grams('beta', 'delta') = 1
2-grams('apple', 'dappled') = 4
What if I need to feed this to an optimization algorithm that expects a measure of difference, like Levenshtein distance?
This is just an example...I'm looking for a general solution, if one exists. Like how to go from Levenshtein distance to a measure of similarity?
I appreciate any guidance you may offer.
and watch out for
difference = 0
Doing 1/similarity is not going to keep the properties of the distribution.
the best way is distance (a->b) = highest similarity - similarity (a->b). with highest similarity being the similarity distance with the biggest value. You hence flip your distribution. the highest similarity becomes 0 etc
In the case of Levenshtein distance, you could increase the sim score by 1 for every time the sequences match; that is, 1 for every time you didn't need a deletion, insertion or substitution. That way the metric would be a linear measure of how many characters the two strings have in common.
In one of my projects (based on Collaborative Filtering) I had to convert between correlation (cosine between vectors) which was from -1 to 1 (closer 1 is more similar, closer to -1 is more diverse) to normalized distance (close to 0 the distance is smaller and if it's close to 1 the distance is bigger)
In this case: distance ~ diversity
My formula was:
dist = 1 - (cor + 1)/2
If you have similarity to diversity and the domain is [0,1] in both cases the simlest way is:
dist = 1 - sim
sim = 1 - dist
Let d denotes distance, s denotes similarity. To convert distance measure to similarity measure, we need to first normalize d to [0 1], by using d_norm = d/max(d). Then the similarity measure is given by:
s = 1 - d_norm.
where s is in the range [0 1], with 1 denotes highest similarity (the items in comparison are identical), and 0 denotes lowest similarity (largest distance).
Cosine similarity is widely used for n-gram count or TFIDF vectors.
Cosine similarity can be used to compute a formal distance metric according to wikipedia. It obeys all the properties of a distance that you would expect (symmetry, nonnegativity, etc):
Both of these metrics range between 0 and 1.
If you have a tokenizer that produces N-grams from a string you could use these metrics like this:
I found the elegant inner product of
Counter
in this SO answer