I thought we might be able to compile a Caffeinated description of some methods of performing multiple category classification.
By multi category classification I mean: The input data containing representations of multiple model output categories and/or simply being classifiable under multiple model output categories.
E.g. An image containing a cat & dog would output (ideally) ~1 for both the cat & dog prediction categories and ~0 for all others.
Based on this paper, this stale and closed PR and this open PR, it seems caffe is perfectly capable of accepting labels. Is this correct?
Would the construction of such a network require the use of multiple neuron (inner product -> relu -> inner product) and softmax layers as in page 13 of this paper; or does Caffe's ip & softmax presently support multiple label dimensions?
When I'm passing my labels to the network which example would illustrate the correct approach (if not both)?:
E.g. Cat eating apple Note: Python syntax, but I use the c++ source.
Column 0 - Class is in input; Column 1 - Class is not in input
[[1,0], # Apple [0,1], # Baseball [1,0], # Cat [0,1]] # Dog
or
Column 0 - Class is in input
[[1], # Apple [0], # Baseball [1], # Cat [0]] # Dog
If anything lacks clarity please let me know and I will generate pictorial examples of the questions I'm trying to ask.
Nice question. I believe there is no single "canonical" answer here and you may find several different approaches to tackle this problem. I'll do my best to show one possible way. It is slightly different than the question you asked, so I'll re-state the problem and suggest a solution.
The problem: given an input image and a set of
C
classes, indicate for each class if it is depicted in the image or not.Inputs: in training time, inputs are pairs of image and a
C
-dim binary vector indicating for each class of theC
classes if it is present in the image or not.Ouput: given an image, output a
C
-dim binary vector (same as the second form suggested in your question).Making caffe do the job: In order to make this work we need to modify the top layers of the net using a different loss.
But first, let's understand the usual way caffe is used and then look into the changes needed.
The way things are now: image is fed into the net, goes through conv/pooling/... layers and finally goes through an
"InnerProduct"
layer withC
outputs. TheseC
predictions goes into a"Softmax"
layer that inhibits all but the most dominant class. Once a single class is highlighted"SoftmaxWithLoss"
layer checks that the highlighted predicted class matches the ground truth class.What you need: the problem with the existing approach is the
"Softmax"
layer that basically selects a single class. I suggest you replace it with a"Sigmoid"
layer that maps each of theC
outputs into an indicator whether this specific class is present in the image. For training, you should use"SigmoidCrossEntropyLoss"
instead of the"SoftmaxWithloss"
layer.Since one image can have multiple labels. The most intuitive way is to think this problem as a C independt binary classification problem where C is the total number of different classes. So it is easy to understand what @Shai have said: