Tanimoto coefficient distance measure

2019-06-27 08:56发布

问题:

Can two objects have identical cosine and Tanimoto coefficient distance measure, where

Tanimoto distance measure, d(x,y) = x.y / (|x|*|x|) + (|y|*|y|)- x*y

and

cosine measure, d(x,y) = x.y /(|x|* |x|) * (|y| *|y|)

回答1:

The Tanimoto similarity coefficient (which is not a true distance measure) is defined by

d(x,y) = x.y / ((|x|*|x|) + (|y|*|y|)- x.y)

for bit vectors x and y.

Now compare that with the cosine similarity coefficent,

 d(x,y) = x.y / (|x| * |y|)

The denominators differ by a x.y term. The Tanimoto and cosine similarity coefficients would be the same if x.y is zero.

Geometrically, x.y is zero if and only if x and y are perpendicular.

Since x and y are bit vectors (i.e. whose values in each dimension can only be 0 or 1), x.y equalling zero means

x1*y1 + x2*y2 + ... + xn*yn = 0

If xi*yi = 1*1 = 1, then the whole sum would be positive. For the whole sum to be zero, no term xi*yi can equal 1. They must all equal 0:

So

x1*y1 = 0
x2*y2 = 0
...
xn*yn = 0

In other words, if xi is 1, then yi must be 0, and vice versa.

So there are tons of examples where the Tanimoto similarity is equal to the cosine similarity:

x = (0,1,0,1)
y = (1,0,0,0)

for instance.



回答2:

Even though the general form of Tanimoto distance was presented, you must always remember that, computationally, there is a binary form and continuous form.

The binary form is:

d(x,y) = n(X ∩ Y) / [ n(X) + n(Y) - n(X ∩ Y) ]

while the continuous form is:

d(x,y) = X.Y / (||X|| + ||Y|| - X.Y )

The difference is clear. If a coder is working for you, you must instruct them that n(X ∩ Y), n(X), n(Y) only involving counting the number of ones in the vectors. Whereas for ||X|| and ||Y|| you must state that the square root of (X1^2 + X2^2 + ... Xp^2) is required since ||X|| is the length of the vector X from the origin (also called the norm). Taking square roots for the binary form is unnecessary and would be computationally expensive (wasteful) for big data mining, since irrational math functions are expensive. However, for the continuous variant, you must use the square root.

In summary, always remember that for Tanimoto distance, there are two types: binary and continuous.