Skip to content

Latest commit

 

History

History
177 lines (139 loc) · 8.08 KB

README.md

File metadata and controls

177 lines (139 loc) · 8.08 KB

Hiring research interns for neural architecture search projects: [email protected]

Rethinking and Improving Relative Position Encoding for Vision Transformer

[Paper]

Object Detection: DETR with iRPE

Model Zoo

We equip DETR models with contextual product shared-head RPE, and report their mAP on MSCOCO dataset.

  • Absolute Position Encoding: Sinusoid

  • Relative Position Encoding: iRPE (contextual product shared-head RPE)

enc_rpe2d Backbone #Buckets epoch AP AP_50 AP_75 AP_S AP_M AP_L Link Log
rpe-1.9-product-ctx-1-k ResNet-50 7 x 7 150 0.409 0.614 0.429 0.195 0.443 0.605 link log, detail (188 MB)
rpe-2.0-product-ctx-1-k ResNet-50 9 x 9 150 0.410 0.615 0.434 0.192 0.445 0.608 link log, detail (188 MB)
rpe-2.0-product-ctx-1-k ResNet-50 9 x 9 300 0.422 0.623 0.446 0.205 0.457 0.613 link log, detail (375 MB)

--enc_rpe2d is an argument to represent the attributions of relative position encoding.

Usage

Setup

  1. Install 3rd-party packages from requirements.txt.
pip install -r ./requirements.txt
  1. [Optional, Recommend] Build iRPE operators implemented by CUDA.

Although iRPE can be implemented by PyTorch native functions, the backward speed of PyTorch index function is very slow. We implement CUDA operators for more efficient training and recommend to build it. nvcc is necessary to build CUDA operators.

cd rpe_ops/
python setup.py install --user

Data Preparation

You can download the MSCOCO dataset from https://cocodataset.org/#download.

Please download the following files:

After downloading them, move the three archieves into the same directory, then decompress the annotations archive by unzip ./annotations_trainval2017.zip. We DO NOT compress the images archieves.

The dataset should be saved as follow,

coco_data
├── annotations
│   ├── captions_train2017.json
│   ├── captions_val2017.json
│   ├── instances_train2017.json
│   ├── instances_val2017.json
│   ├── person_keypoints_train2017.json
│   └── person_keypoints_val2017.json
├── train2017.zip
└── val2017.zip

The zipfile train2017.zip and val2017.zip can also be decompressed.

coco_data
├── annotations
│   ├── captions_train2017.json
│   ├── captions_val2017.json
│   ├── instances_train2017.json
│   ├── instances_val2017.json
│   ├── person_keypoints_train2017.json
│   └── person_keypoints_val2017.json
├── train2017
│   └── 000000000009.jpg
└── val2017
│   └── 000000000009.jpg

Argument for iRPE

We add an extra argument --enc_rpe2d rpe-{ratio}-{method}-{mode}-{shared_head}-{rpe_on} for iRPE. It means that we add relative position encoding on all the encoder layers.

Here is the format of the variables ratio, method, mode, shared_head and rpe_on.

Parameters
----------
ratio: float
    The ratio to control the number of buckets.
    Example: 1.9, 2.0, 2.5, 3.0
    For the product method,

    ratio | The number of buckets
    ------|-----------------------
    1.9   | 7 x 7
    2.0   | 9 x 9
    2.5   | 11 x 11
    3.0   | 13 x 13

method: str
    The method name of image relative position encoding.
    Example: `euc` or `quant` or `cross` or `product`
    euc: Euclidean method
    quant: Quantization method
    cross: Cross method
    product: Product method
mode: str
    The mode of image relative position encoding.
    Example: `bias` or `ctx`
shared_head: bool
    Whether to share weight among different heads.
    Example: 0 or 1
    0: Do not share encoding weight among different heads.
    1: Share encoding weight among different heads.
rpe_on: str
    Where RPE attaches.
    "q": RPE on queries
    "k": RPE on keys
    "v": RPE on values
    "qk": RPE on queries and keys
    "qkv": RPE on queries, keys and values

If we want a image relative position encoding with contextual product shared-head 9 x 9 buckets, the argument is --enc_rpe2d rpe-2.0-product-ctx-1-k.

Training

  • Train a DETR-ResNet50 with iRPE (contextual product shared-head 9 x 9 buckets) for 150 epochs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 100 --epochs 150 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output'
  • Train a DETR-ResNet50 with iRPE (contextual product shared-head 9 x 9 buckets) for 300 epochs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 200 --epochs 300 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output'

where --nproc_per_node 8 means using 8 GPUs to train the model. /coco_data is the dataset folder, and ./output is the model checkpoint folder.

Evaluation

The step is similar to training. Add the checkpoint path and the flag --eval --resume <the checkpoint path>.

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --lr_drop 100 --epochs 150 --coco_path ./coco_data --enc_rpe2d rpe-2.0-product-ctx-1-k --output_dir ./output --eval --resume rpe-2.0-product-ctx-1-k.pth'

Code Structure

Our code is based on DETR. The implementation of MultiheadAttention is based on PyTorch native operator (module, function). Thank you!

File Description
models/rpe_attention/irpe.py The implementation of image relative position encoding
models/rpe_attention/multi_head_attention.py The nn.Module MultiheadAttention with iRPE
models/rpe_attention/rpe_attention_function.py The function rpe_multi_head_attention_forward with iRPE
rpe_ops The CUDA implementation of iRPE operators for efficient training

Citing iRPE

If this project is helpful for you, please cite it. Thank you! : )

@InProceedings{iRPE,
    title     = {Rethinking and Improving Relative Position Encoding for Vision Transformer},
    author    = {Wu, Kan and Peng, Houwen and Chen, Minghao and Fu, Jianlong and Chao, Hongyang},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {10033-10041}
}

License

Apache License