Questions about loss function in yolov2?

2019-08-29 23:59发布

问题:

I read the yolov2 implementation.I have some questions about it's loss.Below is the pseudo code of the loss function, i hope i got it right.

costs = np.zeros(output.shape)
for pred_box in all prediction box:  
    if (max iou pred_box has with all truth box < threshold):
        costs[pred_box][obj] = (sigmoid(obj)-0)^2 * 1
    else:
        costs[pred_box][obj] = 0
    costs[pred_box][x] = (sigmoid(x)-0.5)^2 * 0.01  
    costs[pred_box][y] = (sigmoid(y)-0.5)^2 * 0.01  
    costs[pred_box][w] = (w-0)^2 * 0.01  
    costs[pred_box][h] = (h-0)^2 * 0.01  
for truth_box all ground truth box:  
    pred_box = the one prediction box that is supposed to predict for truth_box
    costs[pred_box][obj] = (1-sigmoid(obj))^2 * 5  
    costs[pred_box][x] = (sigmoid(x)-truex)^2 * (2- truew*trueh/imagew*imageh)  
    costs[pred_box][y] = (sigmoid(y)-truey)^2 * (2- truew*trueh/imagew*imageh)  
    costs[pred_box][w] = (w-log(truew/anchorw))^2 * (2- truew*trueh/imagew*imageh)  
    costs[pred_box][h] = (h-log(trueh/anchorh))^2 * (2- truew*trueh/imagew*imageh)  
    costs[pred_box][classes] = softmax_euclidean  
total loss = sum(costs)

I have some questions about that :

1.The code randomly resize the train images to dimensions between 320 and 608 every 10 batch,but the anchor box isn't resized accordingly.why not resize the anchor size too.I mean you selected a set of most common anchors in a 13*13 feature map,those anchors won't be common in a 19*19 feature map,so why not resize anchor according to image size.

2.Is applying cost for x,y,w,h prediction of boxes that isn't assigned a truth,which pushes w,h to exactly fit the anchor and x,y to center in the cell by default ,helpful and why is that.Why not apply cost of location prediction only to the ones assigned a truth and ignore unassigned ones.

3.Why not simply apply (obj-0)^2 as cost of obj prediction of all boxes with no truth assigned.In yolov2,obj prediction for boxes with no truth assigned are not all applied cost,only those with no truth assigned and don't overlap much with all truth and are applied cost. Why is that ,it's complicated.

回答1:

1

In the implementation of YOLOv2, Random Cropping is used to augment the training data. Random cropping crops a part of image and expand it such that it has the same size as the original one.

This augmentation of training data makes the trained network robust with different sizes of object that it had not seen in the training data. So the anchor boxes should not be changed through this process.

Remember that anchor boxes are assumptions on the shape of the objects which is fed before training and prediction. But if the network puts some assumption like this, it becomes non-robust with objects that have shapes much different from the assumption. Data augmentation addresses this problem.

2

This is because we don't know the truth for the center coordinates and the box shape. When we train YOLO, we use the concept Responsible Boxes. They are boxes that are to be updated through the training process.

Please see the section " ‘Responsible’ Bounding Boxes" of my Medium post.

3 This is because the output of YOLO comes directory from a convolutional layer, not from an activation of fully connected. Thus the output is not restricted between 0 and 1. So we apply sigmoid function such that it represents a probability.