可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

There are different methods to calculate distance between two vectors of the same length: Euclidean, Manhattan, Hamming ...

I'm wondering about any method that would calculate distance between vectors of different length.

回答1:

The Euclidean distance formula finds the distance between any two points in Euclidean space.

A point in Euclidean space is also called a Euclidean vector.

You can use the Euclidean distance formula to calculate the distance between vectors of two different lengths.

For vectors of different dimension, the same principle applies.

Suppose a vector of lower dimension also exists in the higher dimensional space. You can then set all of the missing components in the lower dimensional vector to 0 so that both vectors have the same dimension. You would then use any of the mentioned distance formulas for computing the distance.

For example, consider a 2-dimensional vector A in R² with components (a1,a2), and a 3-dimensional vector B in R³ with components (b1,b2,b3).

To express A in R³, you would set its components to (a1,a2,0). Then, the Euclidean distance d between A and B can be found using the formula:

d² = (b1 - a1)² + (b2 - a2)² + (b3 - 0)²

d = sqrt((b1 - a1)² + (b2 - a2)² + b3²)

For your particular case, the components will be either 0 or 1, so all differences will be -1, 0, or 1. The squared differences will then only be 0 or 1.

If you're using integers or individual bits to represent the components, you can use simple bitwise operations instead of some arithmetic (^ means XOR or exclusive or):

d = sqrt(b1 ^ a1 + b2 ^ a2 + ... + b(n-1) ^ a(n-1) + b(n) ^ a(n))

And we're assuming the trailing components of A are 0, so the final formula will be:

d = sqrt(b1 ^ a1 + b2 ^ a2 + ... + b(n-1) + b(n))

回答2:

You cannot directly compute distances between vectors of differing length.

All suggestions here start with a function that maps the lower-length vector to a higher-length one, then doing the calculation as normal.

There are many, many functions (infinitely many, in fact) that one can use:

Fill up with zeroes. It's the easiest thing to do. Say, if you have a car and need to compute its distance to an airplane, this places the car at sea level.
Look up the missing values somewhere. With the car-airplane example, you'd fire up your geo database and look up heights from longitude/latitude.
Use some mathematical function.

Since the result of the distance calculation strongly depends on the function that converts the shorter vector to the longer, everybody needs to be clear about what function is used. Either because everybody in the fields agrees that only one function makes sense, or because the function used in the conversion is noted down.

回答3:

You can try to calculate the average minimum distance between two vectors p and q of dimensions n and m (n ~= m):

d = 1/n * sum_i=1:n ( min_j=1:m (p(i) - q(j))) + 1/m * sum_j=1:m (min_i=1:n (p(i) - q(j)))

回答4:

The idea of padding the short-sized array with zeros to have the same length like the long-sized array doesn't seem "generally" a correct idea.

For example, if we have two sets (arrays, vectors,...) of measurements for the same parameter (e.g. temperature, speed or a binary parameter as the status of an on/off switch) made at different time instants. Assume that the first set A1 consists of N measurements made at a set of instants T1 whereas the second set A2 consists of M measurements (M~=N) taken at a set of instants T2.

Please note that the distribution of T2 arbitrarily differs from that of T1. Thus, padding with zeros here doesn't make sense.

In this case, I suggest to use interpolation by using a common set of time instants , say T as follows:

A1_new = interpolate (T1, A1, T);

A2_new = interpolate (T2, A2, T);

where interpolate(x,y,xq) accepts the inputs as the variable x, the function y(x) and the query points xq. The 'interpolate' function returns the interpolated output y(xq).

Now, we can compare the same-size sets A1_new and A2_new by any suitable measure e.g. Euclidean distance.