In computer vision and object detection, the common evaluation method is mAP. What is it and how is it calculated?
相关问题
- How to get the background from multiple images by
- How to conditionally scale values in Keras Lambda
- How to get the bounding box of text that are overl
- Trying to understand Pytorch's implementation
- ParameterError: Audio buffer is not finite everywh
相关文章
- How to get the current frame (as a Bitmap) for and
- How to use cross_val_score with random_state
- How to measure overfitting when train and validati
- McNemar's test in Python and comparison of cla
- How to disable keras warnings?
- Invert MinMaxScaler from scikit_learn
- how to tell if android user came from home screen
- How should I vectorize the following list of lists
mAP is Mean Average Precision.
Its use is different in the field of Information Retrieval (Reference [1] [2] )and Multi-Class classification (Object Detection) settings.
To calculate it for Object Detection, you calculate the average precision for each class in your data based on your model predictions. Average precision is related to the area under the precision-recall curve for a class. Then Taking the mean of these average individual-class-precision gives you the Mean Average Precision.
To calculate Average Precision, see [3]
Commonly, IoU > 0.5 means that it was a hit, otherwise it was a fail. For each class, one can calculate the
Note: If one wants better proposals, one does increase the IoU from 0.5 to a higher value (up to 1.0 which would be perfect). One can denote this with mAP@p, where p \in (0, 1) is the IoU.
mAP@[.5:.95]
means that the mAP is calculated over multiple thresholds and then again being averagedEdit: For more detailed Information see the COCO Evaluation metrics
I think the important part here is linking how object detection can be considered the same as the standard information retrieval problems for which there exists at least one excellent description of average precision.
The output of some object detection algorithm is a set of proposed bounding boxes, and for each one, a confidence and classification scores (one score per class). Let's ignore the classification scores for now, and use the confidence as input to a threshold binary classification. Intuitively, the average precision is an aggregation over all choices for the threshold/cut-off value. But wait; in order to calculate precision, we need to know if a box is correct!
This is where it gets confusing/difficult; as opposed to typical information retrieval problems, we actually have an extra level of classification here. That is, we can't do exact matching between boxes, so we need to classify if a bounding box is correct or not. The solution is to essentially do a hard-coded classification on the box dimensions; we check if it sufficiently overlaps with any ground truth to be considered 'correct'. The threshold for this part is chosen by common sense. The dataset you are working on will likely define what this threshold for a 'correct' bounding box is. Most datasets just set it at 0.5 IoU and leave it at that (I recommend doing a few manual IoU calculations [they're not hard] to get a feel for how strict IoU of 0.5 actually is).
Now that we have actually defined what it means to be 'correct', we can just use the same process as information retrieval.
To find mean average precision (mAP), you just stratify your proposed boxes based on the maximum of the classification scores associated with those boxes, then average (take the mean) of the average precision (AP) over the classes.
TLDR; make the distinction between determining if a bounding box prediction is 'correct' (extra level of classification) and evaluating how well the box confidence informs you of a 'correct' bounding box prediction (completely analogous to information retrieval case) and the typical descriptions of mAP will make sense.
It's worth noting that Area under the Precision/Recall curve is the same thing as average precision, and we are essentially approximating this area with the trapezoidal or right-hand rule for approximating integrals.
Quotes are from the above mentioned Zisserman paper - 4.2 Evaluation of Results (Page 11):
First an "overlap criterion" is defined as an intersection-over-union greater than 0.5. (e.g. if a predicted box satisfies this criterion with respect to a ground-truth box, it is considered a detection). Then a matching is made between the GT boxes and the predicted boxes using this "greedy" approach:
Hence each predicted box is either True-Positive or False-Positive. Each ground-truth box is True-Positive. There are no True-Negatives.
Then the average precision is computed by averaging the precision values on the precision-recall curve where the recall is in the range [0, 0.1, ..., 1] (e.g. average of 11 precision values). To be more precise, we consider a slightly corrected PR curve, where for each curve point (p, r), if there is a different curve point (p', r') such that p' > p and r' >= r, we replace p with maximum p' of those points.
What is still unclear to me is what is done with those GT boxes that are never detected (even if the confidence is 0). This means that there are certain recall values that the precision-recall curve will never reach, and this makes the average precision computation above undefined.
Edit:
Short answer: in the region where the recall is unreachable, the precision drops to 0.
One way to explain this is to assume that when the threshold for the confidence approaches 0, an infinite number of predicted bounding boxes light up all over the image. The precision then immediately goes to 0 (since there is only a finite number of GT boxes) and the recall keeps growing on this flat curve until we reach 100%.