This project focuses on fine-grained sports action recognition using two main architectures:
-
CNN-based Sequence Models: These models combine CNNs for feature extraction with RNNs(GRU layers) for temporal sequence modeling:
- VGG19
- InceptionV3
- InceptionV4-ResNet (hybrid model)
- EfficientNetB4
-
ViViT (Video Vision Transformer): A pure transformer-based approach for end-to-end video classification, capturing both spatial and temporal features.
- Feature Extractors: VGG19, InceptionV3, InceptionV4-ResNet, EfficientNetB4
- Temporal Model: GRU layers
- Transformer-based model for video classification
- Spatiotemporal attention and tubelet embedding
Each model is evaluated using:
- Accuracy, Precision, Recall, F1-Score
- Training/validation curves
- Confusion matrix
- Dr. Lina Chato
- UCF101 dataset
- TensorFlow team
- All the cited authors