- From "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. of the IEEE Int. Symposium on Multimedia (ISM), Dec. 2021.
- Written by Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris and Ioannis Patras.
- This software can be used for training a deep learning architecture which estimates frames' importance after modeling their dependencies with the help of global and local multi-head attention mechanisms that integrate a positional encoding component. Training is performed in a supervised manner based on ground-truth data (human-generated video summaries). After being trained on a collection of videos, the PGL-SUM model is capable of producing representative summaries for unseen videos, according to a user-specified time-budget about the summary duration.
Developed, checked and verified on an Ubuntu 20.04.3
PC with a NVIDIA TITAN Xp
GPU. Main packages required:
Python |
PyTorch |
CUDA Version |
cuDNN Version |
TensorBoard |
TensorFlow |
NumPy |
H5py |
---|---|---|---|---|---|---|---|
3.8(.8) | 1.7.1 | 11.0 | 8005 | 2.4.1 | 2.3.0 | 1.20.2 | 2.10.0 |
Structured h5 files with the video features and annotations of the SumMe and TVSum datasets are available within the data folder. The GoogleNet features of the video frames were extracted by Ke Zhang and Wei-Lun Chao and the h5 files were obtained from Kaiyang Zhou. These files have the following structure:
/key /features 2D-array with shape (n_steps, feature-dimension) /gtscore 1D-array with shape (n_steps), stores ground truth importance score (used for training, e.g. regression loss) /user_summary 2D-array with shape (num_users, n_frames), each row is a binary vector (used for test) /change_points 2D-array with shape (num_segments, 2), each row stores indices of a segment /n_frame_per_seg 1D-array with shape (num_segments), indicates number of frames in each segment /n_frames number of frames in original video /picks positions of subsampled frames in original video /n_steps number of subsampled frames /gtsummary 1D-array with shape (n_steps), ground truth summary provided by user (used for training, e.g. maximum likelihood) /video_name (optional) original video name, only available for SumMe dataset
Original videos and annotations for each dataset are also available in the dataset providers' webpages:
To train the model using one of the aforementioned datasets and for a number of randomly created splits of the dataset (where in each split 80% of the data is used for training and 20% for testing) use the corresponding JSON file that is included in the data/splits directory. This file contains the 5 randomly-generated splits that were utilized in our experiments.
For training the model using a single split, run:
python model/main.py --split_index N --n_epochs E --batch_size B --video_type 'dataset_name'
where, N
refers to the index of the used data split, E
refers to the number of training epochs, B
refers to the batch size, and dataset_name
refers to the name of the used dataset.
Alternatively, to train the model for all 5 splits, use the run_summe_splits.sh
and/or run_tvsum_splits.sh
script and do the following:
chmod +x model/run_summe_splits.sh # Makes the script executable.
chmod +x model/run_tvsum_splits.sh # Makes the script executable.
./model/run_summe_splits.sh # Runs the script.
./model/run_tvsum_splits.sh # Runs the script.
Please note that after each training epoch the algorithm performs an evaluation step, using the trained model to compute the importance scores for the frames of each video of the test set. These scores are then used by the provided evaluation scripts to assess the overall performance of the model (in F-Score).
The progress of the training can be monitored via the TensorBoard platform and by:
- opening a command line (cmd) and running:
tensorboard --logdir=/path/to/log-directory --host=localhost
- opening a browser and pasting the returned URL from cmd.
Setup for the training process:
- In
data_loader.py
, specify the path to the h5 file of the used dataset, and the path to the JSON file containing data about the utilized data splits. - In
configs.py
, define the directory where the analysis results will be saved to.
Arguments in configs.py
:
Parameter name | Description | Default Value | Options |
---|---|---|---|
--mode |
Mode for the configuration. | 'train' | 'train', 'test' |
--verbose |
Print or not training messages. | 'false' | 'true', 'false' |
--video_type |
Used dataset for training the model. | 'SumMe' | 'SumMe', 'TVSum' |
--input_size |
Size of the input feature vectors. | 1024 | int > 0 |
--seed |
Chosen number for generating reproducible random numbers. | 12345 | None, int |
--fusion |
Type of the used approach for feature fusion. | 'add' | None, 'add', 'mult', 'avg', 'max' |
--n_segments |
Number of video segments; equal to the number of local attention mechanisms. | 4 | None, int ≥ 2 |
--pos_enc |
Type of the applied positional encoding. | 'absolute' | None, 'absolute', 'relative' |
--heads |
Number of heads of the global attention mechanism. | 8 | int > 0 |
--n_epochs |
Number of training epochs. | 200 | int > 0 |
--batch_size |
Size of the training batch, 20 for 'SumMe' and 40 for 'TVSum'. | 20 | 0 < int ≤ len(Dataset) |
--clip |
Gradient norm clipping parameter. | 5 | float |
--lr |
Value of the adopted learning rate. | 5e-5 | float |
--l2_req |
Value of the regularization factor. | 1e-5 | float |
--split_index |
Index of the utilized data split. | 0 | 0 ≤ int ≤ 4 |
--init_type |
Weight initialization method. | 'xavier' | None, 'xavier', 'normal', 'kaiming', 'orthogonal' |
--init_gain |
Scaling factor for the initialization methods. | None | None, float |
The utilized model selection criterion relies on the post-processing of the calculated losses over the training epochs and enables the selection of a well-trained model by indicating the training epoch. To evaluate the trained models of the architecture and automatically select a well-trained model, define the dataset_path
in compute_fscores.py
and run evaluate_exp.sh
. To run this file, specify:
base_path/exp$exp_num
: the path to the folder where the analysis results are stored,$dataset
: the dataset being used, and$eval_method
: the used approach for computing the overall F-Score after comparing the generated summary with all the available user summaries (i.e., 'max' for SumMe and 'avg' for TVSum).
sh evaluation/evaluate_exp.sh $exp_num $dataset $eval_method
For further details about the adopted structure of directories in our implementation, please check line #6 and line #11 of evaluate_exp.sh
.
We have released the trained models for our main experiments -namely Table III
and Table IV
- of our ISM 2021 paper. The inference.py
script, lets you evaluate the -reported- trained models, for our 5 randomly-created data splits. Firstly, download the trained models, with the following script:
sudo apt-get install unzip wget
wget "https://zenodo.org/record/5635735/files/pretrained_models.zip?download=1" -O pretrained_models.zip
unzip pretrained_models.zip -d inference
rm -f pretrained_models.zip
Then, specify the PATHs for the model
, the split_file
and the dataset
in use. Finally, run the script with the following syntax
python inference/inference.py --table ID --dataset 'dataset_name'
where, ID
refers to the id of the reported table, and dataset_name
refers to the name of the used dataset.
Given the above pre-trained models, we present some additional evaluation results following the rank-based evaluation protocol proposed here, and the diversity measure described here. Moreover, we provide results with regards to the trainable parameters and the required time for training. The code for implementing the rank-based evaluation protocol is available at CA-SUM, and the summary diversity was measured using the relevant code from DSNet.
Dataset | Spearman's ρ | Kendall's τ | Summary Diversity | Params (M) | Train time (sec / epoch) |
---|---|---|---|---|---|
SumMe | - | - | 0.631 | 9.44 | 0.63 |
TVSum | 0.206 | 0.157 | 0.488 | 9.44 | 1.17 |
If you find our work, code or pretrained models, useful in your work, please cite the following publication:
E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. IEEE Int. Symposium on Multimedia (ISM), Dec. 2021.
BibTeX:
@INPROCEEDINGS{9666088,
author = {Apostolidis, Evlampios and Balaouras, Georgios and Mezaris, Vasileios and Patras, Ioannis},
title = {Combining Global and Local Attention with Positional Encoding for Video Summarization},
booktitle = {2021 IEEE International Symposium on Multimedia (ISM)},
month = {December},
year = {2021},
pages = {226-234}
}
Copyright (c) 2021, Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, Ioannis Patras / CERTH-ITI. All rights reserved. This code is provided for academic, non-commercial use only. Redistribution and use in source and binary forms, with or without modification, are permitted for academic non-commercial use provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation provided with the distribution.
This software is provided by the authors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the authors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
A re-implementation of PGL-SUM also appears in ModelScope. Please note that we have not tested this re-implementation and therefore cannot confirm if it fully reproduces our original implementation.