It is a code base for single stage Feature Pyramid Network (FPN) with online hard example mining (OHEM). We implement shared heads, unlike in the paper. Shared heads help to reduce the memory consumption and improve the performance a little.
It is a pure Pytorch 1.0 code, including preprocessing of the input data. Annotations for both COCO and VOC dataset are provided in the same format.
This repository contains a single stage version of FPN present on RetinaNet paper. Objective to reproduce Table 1 with ResNet50 with OHEM.
ResNet is used as a backbone network (a) to build the pyramid features (b). Each classification (c) and regression (d) subnet is made of 4 convolutional layers and finally a convolutional layer to predict the class scores and bounding box coordinated respectively.
We freeze the batch normalisation layers of ResNet based backbone networks.
We use multi-box loss function with online hard example mining (OHEM), similar to SSD. A huge thanks to Max DeGroot, Ellis Brown for Pytorch implementation of SSD and loss function.
We use two types of anchor.
Similar to RetinaNet, we can build anchor boxes or sometimes called prior boxes in SSD.
As a baseline, we use three aspect ratios (AR) and three scale ratios (SR) per pyramid-level, which results in nine anchors per cell location in the grid of each pyramid level.
It is resulting in 67K
total number of anchors/predictions.
OR, we can have only one scale ratio. The reasoning behind that is the pyramid should be able to capture the scale space.
Now, we will have three anchors per cell location in the grid of each pyramid level.
The total number of anchors/predictions close to 22K
.
Although we will save the computational cost this way by predicting fewer boxes, the recall of result anchor boxes drops drastically.
An alternative is to boost the recall of one scale ratio by performing k-means on ground truth boxes with k = 3. We pick three anchors from each pyramid level and initial cluster centre and then perform cluster on ground truth boxes. Intersection-over-union (IoU) is used as a distance metric. Since the cluster centres and centred around the origin, we need to move the centre of each ground truth box to the origin as well.
We performed clustering for coco
and voc
independently.
Here are the recall and average IoU obtained before and after clustering the anchors.
Dataset | Type | SR | AR | #Anchors/level | Total | Avergae IoU | Recall % |
---|---|---|---|---|---|---|---|
VOC | Pre-defined | 3 | 3 | 9 | 67K | 0.78 | 96 |
VOC | Pre-defined | 2 | 3 | 6 | 44K | 0.76 | 95 |
VOC | Pre-defined | 1 | 3 | 3 | 22K | 0.66 | 88 |
VOC | Clustered | 1 | 3 | 3 | 22K | 0.74 | 97 |
COCO | Pre-defined | 3 | 3 | 9 | 67K | 0.72 | 85 |
COCO | Pre-defined | 2 | 3 | 6 | 44K | 0.69 | 85 |
COCO | Pre-defined | 1 | 3 | 3 | 22K | 0.61 | 77 |
COCO | Clustered | 1 | 3 | 3 | 22K | 0.65 | 87 |
There is a variation of the standard network where the features of localisation and classification heads are shared.
Dataset | Backbone | Type | #Anchors | [email protected] % | Download |
---|---|---|---|---|---|
VOC | ResNet50 | Pre-defined | 9 | 81.3 | link |
VOC | ResNet50 | Pre-defined | 3 | 81.3 | link |
VOC | ResNet50 | Clustered | 3 | 82.8 | link |
VOC | ResNet50 | Clustered- SH | 3 | 82.7 | link |
COCO | ResNet50 | Pre-defined | 9 | 46.1 | link |
COCO | ResNet50 | Clustered | 3 | 47.7 | link |
COCO | ResNet50 | Clustered- SH | 3 | 48.3 | link |
Here is GoggleDrive for all the above in signle folder.
Directory structure is similiar to one used in training setup. You can evaluate these models using evaluate.py
and same hypermeter used in training, please read the arguments carefully.
- Input image size is
600
. - Resulting feature map size on five pyramid levels is
[75, 38, 19, 10, 5]
- Batch size is set to
24
, the learning rate of0.0005
. - VOC, number of iterations are
50K
, and learning rate is dropped after40K
iterations - COCO, number of iterations are
150K
, and learning rate is dropped after120K
iterations - VOC can be trained in 2 TitanX GPUs, 12GB each
- COCO would need 3-4 GPUs because the number of classes is 80, hence loss function requires more memory
SH
, i.e.Shared heads
helps to solve memory problem up to a point, but we will still need 2 GPUs to train on VOC or COCO
- We used anaconda 3.7 as python distribution
- You will need Pytorch1.0
- visdom and tensorboardX if you want to use the visualisation of loss and evaluation
-- if you want to use them set visdom/tensorboard flag equal to true while training
-- and configure the visdom port in arguments in
train.py.
- OpenCV is needed as well, install it using
conda install opencv.
Please follow dataset preparation README from prep
folder of this repo.
Once you have pre-processed the dataset, then you are ready to train your networks.
To train run the following command.
python train.py --dataset=coco --basenet=resnet50 --batch_size=24 --lr=0.0005 -j=4 --ngpu=2 --step_values=120000 --max_iter=150000 --visdom=True --tensorboard=True --val_step=15000 --anchor_type=kmeans --shared_heads=1
It will use all the visible GPUs. You can append CUDA_VISIBLE_DEVICES=gpuids-comma-separated
at the beginning of the above command to mask certain GPUs. We used two GPU machine to run these experiments.
Please check the arguments in train.py
to adjust the training process to your liking.
Model is evaluated and saved after each 10K
iterations.
[email protected] is computed after every 10K iterations and at the end.
Coco evaluation protocol is demonstraed in evaluate.py
python evaluate.py --dataset=coco --basenet=resnet50 --batch_size=24 --lr=0.0005 -j=2 --ngpu=2 --eval_iters=150000 --anchor_type=kmeans --shared_heads=1
Here are results COCO using COCO-API using final model with shared heads and kmeans based anchors.
Results using cocoapi
are slightly different than above table. You can compare these results with Detectron From Facebook
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.285
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.492
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.293
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.138
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.318
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.391
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.258
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.412
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.436
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.251
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.479
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.575
To run the demo, we will need to specify the path of the pretrained model (model_path
).
Here is an example of running it.
python demo.py --model_path=<path to the pretrained model>
We may change the detection threshold and other parameters, please see the argument of the file.
There are some demo samples in demo_data\samples\
which are used in the above script and results are saved in demo_data\outputs\
.
Here are some generated results looks like followings
Feature extraction feclity is provided in extract_features.py
.
You can run is similar to demo.py
. Specify the path to pretrained model (model_path
), samples_path
, and save_path
.
By default samples path and save path is pointing to demo_data. It will compute features for top 10
objects in all .jpg
images from samples_path
directory.
We may change the number of nodes (top 10 atm) and other parameters, please see the argument of the file.
Here is an example of running it.
python extract_features.py --model_path=<path to the pretraeind model>
You can take inspiration from data preparation scripts from prep
directory, which we used to pre-process VOC and COCO dataset.
Also checkout README in prep
directory.
If you want to use clustered anchors, then you can either use existing anchors or cluster the anchors yourself using kmeans_for_anchors.py
.