Hiring research interns for neural architecture search projects: houwen.peng@microsoft.com

Rethinking and Improving Relative Position Encoding for Vision Transformer

Object Detection: DETR with iRPE

Model Zoo

We equip DETR models with contextual product shared-head RPE, and report their mAP on MSCOCO dataset.

Absolute Position Encoding: Sinusoid
Relative Position Encoding: iRPE (contextual product shared-head RPE)

enc_rpe2d	Backbone	#Buckets	epoch	AP	AP_50	AP_75	AP_S	AP_M	AP_L	Link	Log
rpe-1.9-product-ctx-1-k	ResNet-50	7 x 7	150	0.409	0.614	0.429	0.195	0.443	0.605	link	log, detail (188 MB)
rpe-2.0-product-ctx-1-k	ResNet-50	9 x 9	150	0.410	0.615	0.434	0.192	0.445	0.608	link	log, detail (188 MB)
rpe-2.0-product-ctx-1-k	ResNet-50	9 x 9	300	0.422	0.623	0.446	0.205	0.457	0.613	link	log, detail (375 MB)

--enc_rpe2d is an argument to represent the attributions of relative position encoding.

Usage

Setup

Install 3rd-party packages from requirements.txt.

pip install -r ./requirements.txt

[Optional, Recommend] Build iRPE operators implemented by CUDA.

Although iRPE can be implemented by PyTorch native functions, the backward speed of PyTorch index function is very slow. We implement CUDA operators for more efficient training and recommend to build it. nvcc is necessary to build CUDA operators.

cd rpe_ops/
python setup.py install --user

Data Preparation

You can download the MSCOCO dataset from https://cocodataset.org/#download.

Please download the following files:

2017 Train images [118K/18GB]
2017 Val images [5K/1GB]
2017 Train/Val annotations [241MB]

After downloading them, move the three archieves into the same directory, then decompress the annotations archive by unzip ./annotations_trainval2017.zip. We DO NOT compress the images archieves.

The dataset should be saved as follow,

coco_data
├── annotations
│   ├── captions_train2017.json
│   ├── captions_val2017.json
│   ├── instances_train2017.json
│   ├── instances_val2017.json
│   ├── person_keypoints_train2017.json
│   └── person_keypoints_val2017.json
├── train2017.zip
└── val2017.zip

The zipfile train2017.zip and val2017.zip can also be decompressed.

coco_data
├── annotations
│   ├── captions_train2017.json
│   ├── captions_val2017.json
│   ├── instances_train2017.json
│   ├── instances_val2017.json
│   ├── person_keypoints_train2017.json
│   └── person_keypoints_val2017.json
├── train2017
│   └── 000000000009.jpg
└── val2017
│   └── 000000000009.jpg

Argument for iRPE

We add an extra argument --enc_rpe2d rpe-{ratio}-{method}-{mode}-{shared_head}-{rpe_on} for iRPE. It means that we add relative position encoding on all the encoder layers.

Here is the format of the variables ratio, method, mode, shared_head and rpe_on.

Parameters
----------
ratio: float
    The ratio to control the number of buckets.
    Example: 1.9, 2.0, 2.5, 3.0
    For the product method,

    ratio | The number of buckets
    ------|-----------------------
    1.9   | 7 x 7
    2.0   | 9 x 9
    2.5   | 11 x 11
    3.0   | 13 x 13

method: str
    The method name of image relative position encoding.
    Example: `euc` or `quant` or `cross` or `product`
    euc: Euclidean method
    quant: Quantization method
    cross: Cross method
    product: Product method
mode: str
    The mode of image relative position encoding.
    Example: `bias` or `ctx`
shared_head: bool
    Whether to share weight among different heads.
    Example: 0 or 1
    0: Do not share encoding weight among different heads.
    1: Share encoding weight among different heads.
rpe_on: str
    Where RPE attaches.
    "q": RPE on queries
    "k": RPE on keys
    "v": RPE on values
    "qk": RPE on queries and keys
    "qkv": RPE on queries, keys and values

If we want a image relative position encoding with contextual product shared-head 9 x 9 buckets, the argument is --enc_rpe2d rpe-2.0-product-ctx-1-k.

Training

Train a DETR-ResNet50 with iRPE (contextual product shared-head 9 x 9 buckets) for 150 epochs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 100 --epochs 150 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output'

Train a DETR-ResNet50 with iRPE (contextual product shared-head 9 x 9 buckets) for 300 epochs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 200 --epochs 300 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output'

where --nproc_per_node 8 means using 8 GPUs to train the model. /coco_data is the dataset folder, and ./output is the model checkpoint folder.

Evaluation

The step is similar to training. Add the checkpoint path and the flag --eval --resume <the checkpoint path>.

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 100 --epochs 150 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output --eval --resume rpe-2.0-product-ctx-1-k.pth'

Code Structure

Our code is based on DETR. The implementation of MultiheadAttention is based on PyTorch native operator (module, function). Thank you!

File	Description
`models/rpe_attention/irpe.py`	The implementation of image relative position encoding
`models/rpe_attention/multi_head_attention.py`	The nn.Module `MultiheadAttention` with iRPE
`models/rpe_attention/rpe_attention_function.py`	The function `rpe_multi_head_attention_forward` with iRPE
`rpe_ops`	The CUDA implementation of iRPE operators for efficient training

Citing iRPE

If this project is helpful for you, please cite it. Thank you! : )

@InProceedings{iRPE,
    title     = {Rethinking and Improving Relative Position Encoding for Vision Transformer},
    author    = {Wu, Kan and Peng, Houwen and Chen, Minghao and Fu, Jianlong and Chao, Hongyang},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {10033-10041}
}

License

Apache License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Rethinking and Improving Relative Position Encoding for Vision Transformer

Model Zoo

Usage

Setup

Data Preparation

Argument for iRPE

Training

Evaluation

Code Structure

Citing iRPE

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Rethinking and Improving Relative Position Encoding for Vision Transformer

Model Zoo

Usage

Setup

Data Preparation

Argument for iRPE

Training

Evaluation

Code Structure

Citing iRPE

License