I'm trying to compute the Average Precision (and Mean Average Precision) on the Oxford Building image dataset.
Below there is the code that they provide for computing Average Precision. Notice that pos_set
is the union of the "optimal" and "good" images from the ground trouth set, while junk_set
is a set of not-relevant images.
void OxfordTest::computeAp(std::vector<std::string> &ranked_list){
float old_recall = 0.0;
float old_precision = 1.0;
float ap = 0.0;
size_t intersect_size = 0;
size_t i = 0;
size_t j = 0;
for ( ; i<ranked_list.size(); ++i) {
if(!pos_set.count(ranked_list[i]))
std::cin.get();
}
if (junk_set.count(ranked_list[i])) continue;
if (pos_set.count(ranked_list[i])) intersect_size++;
float recall = intersect_size / (float)pos_set.size();
float precision = intersect_size / (j + 1.0);
ap += (recall - old_recall)*((old_precision + precision)/2.0);
old_recall = recall;
old_precision = precision;
j++;
}
}
Which is something totally different from the notion given on the linked Wikipedia page. What is the correlation between these notions?
I'm more than sure that Wikipedia's notion is correct, since it corresponds with the one given in this answer and this article.
I don't understand why in the code above it is reported:
- The recall, while the Wikipedia's notion include only precision in the last formula.
- Even considering the formula with the delta recall, nobody talks about `(old_precision + precision) /2
This is the C++ original code.
Junk set
The original paper states:
(3) Junk – less than 25% of the object
is visible, or there is a very high level of occlusion or distortion.
(4) Absent – the object is not present
I.e. junk images are not negatives. There are positives (OK+Good), ignores (Junk) and negatives (Absent). Note that all these are per-query, i.e. some images are junk for query 1 but not for query 15. If you look at the images that are 'junk' you'll see ambiguous examples, e.g. some cases have extreme zoom or blurring which will make you think if this image contains the queried landmark or not, and cases where only a tiny part of the object is visible so the image is too hard.
In computing the average precision, we use the Good and
Ok images as positive examples of the landmark in question,
Absent images as negative examples and Junk images
as null examples. These null examples are treated as though
they are not present in the database – our score is unaffected
whether they are returned or not.
So the authors defined the junk set to be neither positives nor negatives - the images most likely depict the queried object, but for some of them we are not sure, or it would be too harsh to treat them as positives and ask the system to retrieve these examples (and therefore penalise if it doesn't). At the same time, it would also be harsh to treat them as negatives as if the system does retrieve them, it shouldn't be penalised. So all that needs to be done is that (on a per-query basis) you ignore the junks and treat them as if they don't exist. So you take the retrieved list, filter out all junk images for this query, then run normal AP computation on this filtered list. That's what the code is doing effectively - when the example is in amb(=junk), it is just skipped. Then if the example is not in amb, if it's in pos(itives) the intersect_size (current number of positives up until position i) is incremented. The quantity j (well, j-1) is the number of non-skipped elements in the list (it gets incremented only if the current element is not junk).
AP computation
You certainly need the recall in your AP computation, as explained by shiri in the previous answer, and as described in your article, p(r) is the precision at a particular recall. The best way to think of AP is not to examine a random formula but to understand what is the intuition and then see how the formula captures it, i.e. what wikipedia says at the start: you can plot precision as a function of recall, and AP is then simply the area under the curve. You want the precision to be high at all recalls, so the ideal curve is p(r)=1 which would maximise the AP.
So what is the code doing? It's computing the area under the precision-recall curve using the trapezoidal rule, see this equation on Wikipedia and you'll see it's identical to the code. The AP computation for the discrete case from your Wikipedia article is a (commonly used) worse approximation to the area under the precision-recall curve, the rectangle method.
Recall is definitely relevant for Average Precision, since you are effectively calculating the precision at every possible recall point. You can see this reflected in the first Wikipedia definitions, as you noticed yourself.
A good overview with a clear explanation of AP can also be found here:
https://sanchom.wordpress.com/tag/average-precision/
I'll start with the assumption that this code snippet is correctly calculating AP, and let's see where that leads us. (This is not necessarily true, but given that the paper in question has been cited 1.8K times since 2007, presumably if there was an error someone would've caught it by now.)
Each element contributing to the sum of the AP is defined by Wikipedia as:
P(k) * delta_r(k)
where k is the rank in the sequence of retrieved documents, n is the number of retrieved documents, P(k) is the precision at cut-off k in the list, and delta_r(k) is the change in recall from items k-1 to k.
In other words, this line...
ap += (recall - old_recall)*((old_precision + precision)/2.0);
... is presumably what is adding the sum elements.
It's clear that delta_r(k)==(recall - old_recall)
, so that part is covered.
Now, what about ((old_precision + precision)/2.0)
? This was also what had you concerned.
OK. So. This part is indeed weird. Instead of using P(k) (precision at cutoff k), it's apparently averaging P(k) and P(k-1). I ran this by my labmates (I work at a nationally recognized IR lab), and we couldn't figure out why the code would do that. My hunch is that it's some form of smoothing that the authors chose to do but I can't see why. The other alternative is that the sum is somehow telescoping and that these items cancel each other out. It certainly looks strange.
Edit: This "weird" rule apparently draws from using the trapeziodal rule instead of the rectangle rule for estimating the area under the curve, as explained by Relja Arandjelović in the accepted answer. Adding here for completeness. <\edit>
Meanwhile, you can cross-reference your results from this ranking function against trec_eval and see if you get the same results or different ones.