Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP is not invariant to shuffling the order of detections #650

Open
deanmark opened this issue Jul 12, 2023 · 1 comment
Open

AP is not invariant to shuffling the order of detections #650

deanmark opened this issue Jul 12, 2023 · 1 comment

Comments

@deanmark
Copy link

I'm running the example in pycocoEvalDemo.ipynb. If I shuffle the order of the detections, then in certain shuffles, I get different AP results.

Shuffling:

import json
import random 
anns = json.load(open(resFile))
random.shuffle(anns)
resFile2 = resFile.replace('results.json', 'results2.json')
json.dump(anns, open(resFile2, 'w'), separators=(',', ':'))

Now eval using shuffled file, replace:
cocoDt=cocoGt.loadRes(resFile)
with
cocoDt=cocoGt.loadRes(resFile2 )

With the original detections file, I get the following results:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.50458
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.69697
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.57298
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.58563
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.51940
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.50140
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.38681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.59368
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.59535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.63981
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.56642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.56429

And after shuffling, I get:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.50458
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.69786
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.57293
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.58564
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.51940
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.50140
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.38600
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.59389
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.59557
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.64012
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.56642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.56429

Notice AP@50 changes from 0.69697 to 0.69786!
I'm using the same detections, but the results are slightly different!

@deanmark
Copy link
Author

After some analysis, the bug in the AP calculation seems to arise from the accumulate function, the results are ordered by the dtScores in line 366
inds = np.argsort(-dtScores, kind='mergesort')

The problem happens when several detections have the exact same score, but they have different dtMatches values. The order in which they appear after sort is determined by the order in which they appear in the original detections file. Thus, if the detections have different dtMatches values, some are matched, and some are not, then the final AP calculation is affected by this order.

One way to solve the problem, is to sort by dtScores, and use dtMatches as a tie breaker, thus giving matched detections precedence in the sort. This will solve the bug, and the AP will then be invariant to changes in the input order of detections. But solving this bug will break the current implementation - i.e. new reported scores might differ for some users from their current scores.

Possible fix by changing lines 362-366 in cocoeval.py with:

dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
dtMatches = np.concatenate([e['dtMatches'][0:maxDet] for e in E])

# different sorting method generates slightly different results.
# mergesort is used to be consistent as Matlab implementation.
inds = np.lexsort((np.logical_not(dtMatches), -scores))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant