PyTorch implementation of our ViT-ZSL model for zero-shot learning:
Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning
Faisal Alamri, Anjan Dutta
IMVIP, 2021
Zero-Shot Learning (ZSL) aims to recognise unseen object classes, which are not observed during the training phase. The existing body of works on ZSL mostly relies on pretrained visual features and lacks the explicit attribute localisation mechanism on images. In this work, we propose an attention-based model in the problem settings of ZSL to learn attributes useful for unseen class recognition. Our method uses an attention mechanism adapted from Vision Transformer to capture and learn discriminative attributes by splitting images into small patches. We conduct experiments on three popular ZSL benchmarks (i.e., AWA2, CUB and SUN) and set new state-of-the-art harmonic mean results on all the three datasets, which illustrate the effectiveness of our proposed method.
Follow the instructions provided in data/Dataset_Instruction.txt
Refer to: Conda Environment for more information.
# conda create -n {ENVNAME} python=3.6
conda create -n ViT_ZSL python=3.6
# Activate the environment: conda activate {ENVNAME}
conda activate ViT_ZSL
This is a PyTorch implementation
pip install -r requirements.txt
# PyTorch
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
open ViT_ZSL.ipynb
jupyter notebook ViT_ZSL.ipynb
Please do read our paper. If you still require any further information, feel free to contact us at our emails.
If you use ViT-ZSL in your research, please use the following BibTeX entry.
@InProceedings{Alamri2021ViTZSL,
author = {Faisal Alamri and Anjan Dutta},
title = {Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning},
booktitle = {IMVIP},
year = {2021}
}