Is there a rule of thumb for how big the number of samples should be for the label that represents "everything else" in a multi-class classification task?
Example: I want to classify my input as being one of X
classes. The X + 1
class activates when the input is "none of the above." Suppose my dataset contains 5,000 samples from each of the 10 "positive" classes. For samples representing the "unknown" class, I'd use multiple realistic examples likely to be found in production, but that are not from the other classes.
How big should the number of these negative examples be relative to the other distributions?
This is maybe a bit off-topic, but in any case, I don't think there is a general rule of thumb, it depends on your problem and your approach.
I would consider the following factors:
- The nature of the data. This is a bit abstract, but you can ask yourself whether you would expect samples from the "everything else" class to be easily confused with an actual class. For example, if you want to detect dogs or cats in general images of animals, there are probably many other animals (e.g. foxes) that may confuse the system, but if your input only has images of dogs, cats or furniture, maybe not so much. This is however an intuition only, and in other problems it may not be so clear.
- Your model. For example, in this answer I gave to a related question I mention an approach to model the "everything else" in function of the rest of classes, so you could argue that, if inputs are not too similar (previous point), even with no examples of "everything else" it might just work, since none of the other classes are triggered. Other tricks, like giving different training "weights" to each class (e.g. computed in function of the number of instances you have of each one), may compensate for an unbalanced dataset.
- Your goals. Obviously you want your system to be perfect, but you may consider whether you'd rather have false positives or false negatives (e.g. is it worse to miss an image of a dog or to say there's a dog when there's none). If you expect your input to be mostly composed of instances of "everything else", it may make sense that your model is biased towards that class, or maybe for that very reason you want to be sure you don't discard any potentially interesting sample.
Unfortunately, the only good way of telling whether you are doing ok is experimenting and having good metrics over a representative test dataset (confusion matrix, per-class precision/recall, etc).