GLAD: Global-Local View Alignment and Background Debiasing for Unsupervised Video Domain Adaptation with Large Domain Gap [WACV 2024]
In this work, we tackle the challenging problem of unsupervised video domain adaptation (UVDA) for action recognition. We specifically focus on scenarios with a substantial domain gap, in contrast to existing works primarily deal with small domain gaps between labeled source domains and unlabeled target domains.
So, contributions of this work is 2-fold.
To establish a more realistic setting, we introduce a novel UVDA scenario, denoted as Kinetics→BABEL, with a more considerable domain gap in terms of both temporal dynamics and background shifts.
- To tackle the temporal shift, i.e., action duration difference between the source and target domains, we propose a global-local view alignment approach.
- To mitigate the background shift, we propose to learn temporal order sensitive representations by temporal order learning and background invariant representations by background augmentation. We empirically validate that the proposed method shows significant improvement over the existing methods on the Kinetics→BABEL dataset with a large domain gap.
We provide our working conda environment as an exported yaml file.
conda env create --file requirements/environment.yml
pip install -e .
The AMASS dataset is a comprehensive motion capture skeleton dataset that serves as an input for the BABEL dataset. Unlike the original, our proposed dataset, Kinetics→BABEL, utilizes a different kind of input—rendered videos rather than skeletons. To access these, please create an account on AMASS and download the BMLrub rendered videos.
Make symlinks to the actual dataset paths.
mkdir data
ln -s ./data/k400 /KINETICS/PATH/
ln -s ./data/babel /BABEL/PATH/
We highly recommend to extract rawframes beforehand to optimize I/O. Below are example structures for each dataset.
Kinetics Structure
./data/k400/rawframes_resized
├── train
│ ├── applauding
│ │ ├── 0nd-Gc3HkmU_000019_000029
│ │ │ ├── img_00000.jpg
│ │ │ ├── img_00001.jpg
│ │ │ ├── img_00002.jpg
│ │ │ └── ...
│ │ ├── 0Tq8uFakTbk_000000_000010
│ │ ├── 0XrsfW9ejfk_000000_000010
│ │ ├── 0YQrMye3BBY_000000_000010
│ │ ├── 1WMulo84kBY_000020_000030
│ │ └── ...
│ ├── balloon_blowing
│ ├── ...
│ ├── unboxing
│ └── waxing_legs
└── val
├── applauding
├── balloon_blowing
├── ...
├── unboxing
└── waxing_legs
BABEL Structure
./data/babel
├── train
│ ├── 000000
│ │ ├── img_00001.jpg
│ │ ├── img_00002.jpg
│ │ └── ...
│ ├── 000002
│ └── ...
└── val
├── ...
├── 013286
└── 013288
python utils/extract_median_by_rawframes.py \
--ann-file 'data/filelists/k400/filelist_k400_train_closed.txt' \
--outdir 'data/median/k400' \
--start-index 0 \
--data-prefix 'data/k400/rawframes_resized'
The training process has 2 stages.
- Pretrain TOL (Temporal Ordering Learning)
Then training result will be generated under
source tools/dist_train.sh configs/tol.py 8 \ --seed 0
work_dirs/tol/
, which will be utilized in the next stage. - GLAD
source tools/dist_train.sh configs/glad.py 8 \ --seed 3 \ --validate --test-last --test-best
source tools/dist_test.sh configs/glad.py $(find 'work_dirs/glad' -name '*best*.pth' | head -1) 8 \
--eval 'mean_class_accuracy' 'confusion_matrix'
This project has been made possible through the generous funding and support of NCSOFT Corporation. We extend our sincere gratitude for their contribution and belief in our work.
This project is released under the BSD-3-Clause.
@inproceedings{leebae2024glad,
title={{GLAD}: Global-Local View Alignment and Background Debiasing for Video Domain Adaptation},
author={Lee, Hyogun and Bae, Kyungho and Ha, Seong Jong and Ko, Yumin and Park, Gyeong-Moon and Choi, Jinwoo},
booktitle={Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year={2024}
}