There are different methods to calculate distance between two vectors of the same length: Euclidean, Manhattan, Hamming ...
I'm wondering about any method that would calculate distance between vectors of different length.
There are different methods to calculate distance between two vectors of the same length: Euclidean, Manhattan, Hamming ...
I'm wondering about any method that would calculate distance between vectors of different length.
The Euclidean distance formula finds the distance between any two points in Euclidean space.
A point in Euclidean space is also called a Euclidean vector.
You can use the Euclidean distance formula to calculate the distance between vectors of two different lengths.
For vectors of different dimension, the same principle applies.
Suppose a vector of lower dimension also exists in the higher dimensional space. You can then set all of the missing components in the lower dimensional vector to 0 so that both vectors have the same dimension. You would then use any of the mentioned distance formulas for computing the distance.
For example, consider a 2-dimensional vector A
in R²
with components (a1,a2)
, and a 3-dimensional vector B
in R³
with components (b1,b2,b3)
.
To express A
in R³
, you would set its components to (a1,a2,0)
. Then, the Euclidean distance d
between A
and B
can be found using the formula:
d² = (b1 - a1)² + (b2 - a2)² + (b3 - 0)²
d = sqrt((b1 - a1)² + (b2 - a2)² + b3²)
For your particular case, the components will be either 0
or 1
, so all differences will be -1
, 0
, or 1
. The squared differences will then only be 0
or 1
.
If you're using integers or individual bits to represent the components, you can use simple bitwise operations instead of some arithmetic (^
means XOR
or exclusive or
):
d = sqrt(b1 ^ a1 + b2 ^ a2 + ... + b(n-1) ^ a(n-1) + b(n) ^ a(n))
And we're assuming the trailing components of A
are 0
, so the final formula will be:
d = sqrt(b1 ^ a1 + b2 ^ a2 + ... + b(n-1) + b(n))
You cannot directly compute distances between vectors of differing length.
All suggestions here start with a function that maps the lower-length vector to a higher-length one, then doing the calculation as normal.
There are many, many functions (infinitely many, in fact) that one can use:
Since the result of the distance calculation strongly depends on the function that converts the shorter vector to the longer, everybody needs to be clear about what function is used. Either because everybody in the fields agrees that only one function makes sense, or because the function used in the conversion is noted down.
You can try to calculate the average minimum distance between two vectors p and q of dimensions n and m (n ~= m):
d = 1/n * sum_i=1:n ( min_j=1:m (p(i) - q(j))) + 1/m * sum_j=1:m (min_i=1:n (p(i) - q(j)))
The idea of padding the short-sized array with zeros to have the same length like the long-sized array doesn't seem "generally" a correct idea.
For example, if we have two sets (arrays, vectors,...) of measurements for the same parameter (e.g. temperature, speed or a binary parameter as the status of an on/off switch) made at different time instants. Assume that the first set A1 consists of N measurements made at a set of instants T1 whereas the second set A2 consists of M measurements (M~=N) taken at a set of instants T2.
Please note that the distribution of T2 arbitrarily differs from that of T1. Thus, padding with zeros here doesn't make sense.
In this case, I suggest to use interpolation by using a common set of time instants , say T as follows:
A1_new = interpolate (T1, A1, T);
A2_new = interpolate (T2, A2, T);
where interpolate(x,y,xq) accepts the inputs as the variable x, the function y(x) and the query points xq. The 'interpolate' function returns the interpolated output y(xq).
Now, we can compare the same-size sets A1_new and A2_new by any suitable measure e.g. Euclidean distance.