Maybe I'm rather stupid but I just can't find a satisfying answer:
Using the KNN-algorithm, say k=5. Now I try to classify an unknown object by getting its 5 nearest neighbours. What to do, if after determining the 4 nearest neighbors, the next 2 (or more) nearest objects have the same distance? Which object of these 2 or more should be chosen as the 5th nearest neighbor?
Thanks in advance :)
Which object of these 2 or more should be chosen as the 5th nearest neighbor?
It really depends on how you want to implement it.
Most algorithms will do one of three things:
- Include all equal distance points, so for this estimation, they'll use 6 points, not 5.
- Use the "first" found point of the two equal distant.
- Pick a random (usually with a consistent seed, so results are reproducable) point from the 2 points found.
That being said, most algorithms based on radial searching have an inherent assumption of stationarity, in which case, it really shouldn't matter which of the options above you choose. In general, any of them should, theoretically, provide reasonable defaults (especially since they're the furthest points in the approximation, and should have the lowest effective weightings).
Another and interesting option is to use the nearest neighbor like this:
You calculate the distances of the 5 nearest neighbors from each class to the sample: you will have 5 distances from each class.
Then you get the mean distance for each class.
That lower mean distance will be the class you will assign to the sample.
This way is effective for datasets of classes that overlap.
If you have another distance function, you can use it to break the tie. Even a bad one can do the job, better if you have some heuristics. For instance, if you know that one of the feature considered to compute your main distance is more significant, use only this one to solve the tie.
If it's not the case, pick at random. The run several times your program on the same test set, to check if the random choice matters.
Maybe you can try fuzzy knn. For the choice of k I think lots of experiments should be done in order to get the best classification result.
If you have k=5, you look at the top five records, look at the most common result out of those five. It's probable that you would get two pairs which would put you in a bind and it would be tough, because then you have a 50/50 chance of each pair.
So that makes life challenging. So how do you pick out a value for k? There are some metrics you can use to analyze the result after the fact, but no strict rule of what k must be, so I would make it easy on yourself just starting out and stick with k=3 instead of k=5 and then down the road look into some strategies that can assist you in optimizing the value of k, by looking at the actual accuracy of your predictions.