Hiring research interns for neural architecture search projects: [email protected]
Object Detection: DETR with iRPE
We equip DETR models with contextual product shared-head RPE, and report their mAP on MSCOCO dataset.
-
Absolute Position Encoding: Sinusoid
-
Relative Position Encoding: iRPE (contextual product shared-head RPE)
enc_rpe2d | Backbone | #Buckets | epoch | AP | AP_50 | AP_75 | AP_S | AP_M | AP_L | Link | Log |
---|---|---|---|---|---|---|---|---|---|---|---|
rpe-1.9-product-ctx-1-k | ResNet-50 | 7 x 7 | 150 | 0.409 | 0.614 | 0.429 | 0.195 | 0.443 | 0.605 | link | log, detail (188 MB) |
rpe-2.0-product-ctx-1-k | ResNet-50 | 9 x 9 | 150 | 0.410 | 0.615 | 0.434 | 0.192 | 0.445 | 0.608 | link | log, detail (188 MB) |
rpe-2.0-product-ctx-1-k | ResNet-50 | 9 x 9 | 300 | 0.422 | 0.623 | 0.446 | 0.205 | 0.457 | 0.613 | link | log, detail (375 MB) |
--enc_rpe2d
is an argument to represent the attributions of relative position encoding.
- Install 3rd-party packages from requirements.txt.
pip install -r ./requirements.txt
- [Optional, Recommend] Build iRPE operators implemented by CUDA.
Although iRPE can be implemented by PyTorch native functions, the backward speed of PyTorch index function is very slow. We implement CUDA operators for more efficient training and recommend to build it.
nvcc
is necessary to build CUDA operators.
cd rpe_ops/
python setup.py install --user
You can download the MSCOCO dataset from https://cocodataset.org/#download
.
Please download the following files:
After downloading them, move the three archieves into the same directory, then decompress the annotations archive by unzip ./annotations_trainval2017.zip
. We DO NOT compress the images archieves.
The dataset should be saved as follow,
coco_data
├── annotations
│ ├── captions_train2017.json
│ ├── captions_val2017.json
│ ├── instances_train2017.json
│ ├── instances_val2017.json
│ ├── person_keypoints_train2017.json
│ └── person_keypoints_val2017.json
├── train2017.zip
└── val2017.zip
The zipfile train2017.zip
and val2017.zip
can also be decompressed.
coco_data
├── annotations
│ ├── captions_train2017.json
│ ├── captions_val2017.json
│ ├── instances_train2017.json
│ ├── instances_val2017.json
│ ├── person_keypoints_train2017.json
│ └── person_keypoints_val2017.json
├── train2017
│ └── 000000000009.jpg
└── val2017
│ └── 000000000009.jpg
We add an extra argument --enc_rpe2d rpe-{ratio}-{method}-{mode}-{shared_head}-{rpe_on}
for iRPE. It means that we add relative position encoding on all the encoder layers.
Here is the format of the variables ratio
, method
, mode
, shared_head
and rpe_on
.
Parameters
----------
ratio: float
The ratio to control the number of buckets.
Example: 1.9, 2.0, 2.5, 3.0
For the product method,
ratio | The number of buckets
------|-----------------------
1.9 | 7 x 7
2.0 | 9 x 9
2.5 | 11 x 11
3.0 | 13 x 13
method: str
The method name of image relative position encoding.
Example: `euc` or `quant` or `cross` or `product`
euc: Euclidean method
quant: Quantization method
cross: Cross method
product: Product method
mode: str
The mode of image relative position encoding.
Example: `bias` or `ctx`
shared_head: bool
Whether to share weight among different heads.
Example: 0 or 1
0: Do not share encoding weight among different heads.
1: Share encoding weight among different heads.
rpe_on: str
Where RPE attaches.
"q": RPE on queries
"k": RPE on keys
"v": RPE on values
"qk": RPE on queries and keys
"qkv": RPE on queries, keys and values
If we want a image relative position encoding with contextual product shared-head 9 x 9
buckets, the argument is --enc_rpe2d rpe-2.0-product-ctx-1-k
.
- Train a DETR-ResNet50 with iRPE (contextual product shared-head
9 x 9
buckets) for 150 epochs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 100 --epochs 150 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output'
- Train a DETR-ResNet50 with iRPE (contextual product shared-head
9 x 9
buckets) for 300 epochs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 200 --epochs 300 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output'
where --nproc_per_node 8
means using 8 GPUs to train the model. /coco_data
is the dataset folder, and ./output
is the model checkpoint folder.
The step is similar to training. Add the checkpoint path and the flag --eval --resume <the checkpoint path>
.
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 100 --epochs 150 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output --eval --resume rpe-2.0-product-ctx-1-k.pth'
Our code is based on DETR. The implementation of MultiheadAttention
is based on PyTorch native operator (module, function). Thank you!
File | Description |
---|---|
models/rpe_attention/irpe.py |
The implementation of image relative position encoding |
models/rpe_attention/multi_head_attention.py |
The nn.Module MultiheadAttention with iRPE |
models/rpe_attention/rpe_attention_function.py |
The function rpe_multi_head_attention_forward with iRPE |
rpe_ops |
The CUDA implementation of iRPE operators for efficient training |
If this project is helpful for you, please cite it. Thank you! : )
@InProceedings{iRPE,
title = {Rethinking and Improving Relative Position Encoding for Vision Transformer},
author = {Wu, Kan and Peng, Houwen and Chen, Minghao and Fu, Jianlong and Chao, Hongyang},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {10033-10041}
}