In computer vision and object detection, the common evaluation method is mAP. What is it and how is it calculated?
问题:
回答1:
Quotes are from the above mentioned Zisserman paper - 4.2 Evaluation of Results (Page 11):
First an "overlap criterion" is defined as an intersection-over-union greater than 0.5. (e.g. if a predicted box satisfies this criterion with respect to a ground-truth box, it is considered a detection). Then a matching is made between the GT boxes and the predicted boxes using this "greedy" approach:
Detections output by a method were assigned to ground truth objects satisfying the overlap criterion in order ranked by the (decreasing) confidence output. Multiple detections of the same object in an image were considered false detections e.g. 5 detections of a single object counted as 1 correct detection and 4 false detections
Hence each predicted box is either True-Positive or False-Positive. Each ground-truth box is True-Positive. There are no True-Negatives.
Then the average precision is computed by averaging the precision values on the precision-recall curve where the recall is in the range [0, 0.1, ..., 1] (e.g. average of 11 precision values). To be more precise, we consider a slightly corrected PR curve, where for each curve point (p, r), if there is a different curve point (p', r') such that p' > p and r' >= r, we replace p with maximum p' of those points.
What is still unclear to me is what is done with those GT boxes that are never detected (even if the confidence is 0). This means that there are certain recall values that the precision-recall curve will never reach, and this makes the average precision computation above undefined.
Edit:
Short answer: in the region where the recall is unreachable, the precision drops to 0.
One way to explain this is to assume that when the threshold for the confidence approaches 0, an infinite number of predicted bounding boxes light up all over the image. The precision then immediately goes to 0 (since there is only a finite number of GT boxes) and the recall keeps growing on this flat curve until we reach 100%.
回答2:
mAP is Mean Average Precision.
Its use is different in the field of Information Retrieval (Reference [1] [2] )and Multi-Class classification (Object Detection) settings.
To calculate it for Object Detection, you calculate the average precision for each class in your data based on your model predictions. Average precision is related to the area under the precision-recall curve for a class. Then Taking the mean of these average individual-class-precision gives you the Mean Average Precision.
To calculate Average Precision, see [3]
回答3:
For detection, a common way to determine if one object proposal was right is Intersection over Union (IoU, IU). This takes the set
A
of proposed object pixels and the set of true object pixelsB
and calculates:
Commonly, IoU > 0.5 means that it was a hit, otherwise it was a fail. For each class, one can calculate the
- True Positive TP(c): a proposal was made for class c and there actually was an object of class c
- False Positive FP(c): a proposal was made for class c, but there is no object of class c
- Average Precision for class c:
The mAP (mean average precision) is then:
Note: If one wants better proposals, one does increase the IoU from 0.5 to a higher value (up to 1.0 which would be perfect). One can denote this with mAP@p, where p \in (0, 1) is the IoU.
mAP@[.5:.95]
means that the mAP is calculated over multiple thresholds and then again being averaged
Edit: For more detailed Information see the COCO Evaluation metrics
回答4:
I think the important part here is linking how object detection can be considered the same as the standard information retrieval problems for which there exists at least one excellent description of average precision.
The output of some object detection algorithm is a set of proposed bounding boxes, and for each one, a confidence and classification scores (one score per class). Let's ignore the classification scores for now, and use the confidence as input to a threshold binary classification. Intuitively, the average precision is an aggregation over all choices for the threshold/cut-off value. But wait; in order to calculate precision, we need to know if a box is correct!
This is where it gets confusing/difficult; as opposed to typical information retrieval problems, we actually have an extra level of classification here. That is, we can't do exact matching between boxes, so we need to classify if a bounding box is correct or not. The solution is to essentially do a hard-coded classification on the box dimensions; we check if it sufficiently overlaps with any ground truth to be considered 'correct'. The threshold for this part is chosen by common sense. The dataset you are working on will likely define what this threshold for a 'correct' bounding box is. Most datasets just set it at 0.5 IoU and leave it at that (I recommend doing a few manual IoU calculations [they're not hard] to get a feel for how strict IoU of 0.5 actually is).
Now that we have actually defined what it means to be 'correct', we can just use the same process as information retrieval.
To find mean average precision (mAP), you just stratify your proposed boxes based on the maximum of the classification scores associated with those boxes, then average (take the mean) of the average precision (AP) over the classes.
TLDR; make the distinction between determining if a bounding box prediction is 'correct' (extra level of classification) and evaluating how well the box confidence informs you of a 'correct' bounding box prediction (completely analogous to information retrieval case) and the typical descriptions of mAP will make sense.
It's worth noting that Area under the Precision/Recall curve is the same thing as average precision, and we are essentially approximating this area with the trapezoidal or right-hand rule for approximating integrals.