This is the repository for experiments on the MSCOCO classes mentioned in the paper Unsupervised Hard Example Mining from Videos for Improved Object Detection mentioned in Section 5(Discussion).
We used the original version of py-faster-rcnn-ft to fine-tune the VGG16 network pretrained on ImageNet dataset to convert it to a binary classifier for an MSCOCO category. Once we had the classifier as the backbone network of the Faster RCNN, we used it to label all the frames within a video for the presence of that particular MSCOCO category. Using the labelled frames, we were able to identify the frames containing hard negatives with the help of our algorithm. Finally, we fine tuned the network again after including the frames containing hard negatives and evaluated the network for improvements using held out validation and test sets.
For our research, we carried out experiments on two MSCOCO categories, Dog and Train.
Follow the steps mentioned in the py-faster-rcnn-ft repository to prepare a VGG16 Faster RCNN network trained on an MSCOCO category of your choice.
Scrape the web and download videos that are likely to contain a lot of instances of your chosen category. Helper code to download youtube videos can be found here. Once the videos have been downloaded, run the detections code to label each frame of every video with bounding boxes and confidence scores for that category. See Usage
The list of videos we used is mentioned below :-
The detections code outputs a txt file containing frame wise labeling and bounding box information. Use the hard negative mining code on the detections txt file to output the frames containing hard negatives and a txt file containing the bounding box information on those frames. See Usage.
Use the COCO annotations editor located inside utils to include the frames containing hard negatives in MSCOCO dataset. One the frames have been included in the COCO dataset, fine-tune to get an improved network. See Usage
A summary of the results is mentioned below :-
Category | Model | Training Iterations | Training Hyperparams | Validation set AP | Test set AP |
---|---|---|---|---|---|
Dog | Baseline | 29000 | LR : 1e-3 for 10k, 1e-4 for 10k-20k, 1e-5 for 20k-29k |
26.9 | 25.3 |
Flickers as HN | 22000 | LR : 1e-4 for 15k, 1e-5 for 15k-22k |
28.1 | 26.4 | |
Train | Baseline | 26000 | LR : 1e-3, stepsize : 10k, lr decay : 0.1 |
33.9 | 33.2 |
Flickers as HN | 24000 | LR : 1e-3, stepsize : 10k, lr decay : 0.1 |
35.4 | 33.7 |
A few examples on the reduction in false positives achieved for the 'Dog' category are mentioned below :-
Baseline | Flickers as HN |
---|---|