- [2015 AAAI] Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework, [paper], [bibtex].
- [2018 ECCV] Find and Focus: Retrieve and Localize Video Events with Natural Language Queries, [paper], [bibtex].
- [2018 ECCV] Cross-Modal and Hierarchical Modeling of Video and Text, [paper], [bibtex], sources: [zbwglory/CMHSE].
- [2019 CVPR] Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval, [paper], [bibtex], sources: [yalesong/pvse].
- [2020 IEEE TM] SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries, [paper], [bibtex].
- [2015 ICCV] Sequence to Sequence: Video to Text, [paper], [bibtex], [homepage], sources: [vsubhashini/caffe/examples/s2vt].
- [2017 ICCV] Dense-Captioning Events in Videos, [paper], [bibtex], [homepage], source: [ranjaykrishna/densevid_eval].
- [2017 ArXiv] Multi-Task Video Captioning with Video and Entailment Generation, [paper], [bibtex].
- [2018 CVPR] Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning, [paper], [bibtex], sources: [JaywongWang/DenseVideoCaptioning].
- [2018 CVPR] End-to-End Dense Video Captioning with Masked Transformer, [paper], [bibtex], sources: [salesforce/densecap].
- [2018 CVPR] Finding It: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos, [paper], [bibtex], [supplementary], [poster], [homepage], [youtube].
- [2018 NeurIPS] Weakly Supervised Dense Event Captioning in Videos, [paper], [bibtex], sources: [XgDuan/WSDEC].
- [2019 WACV] Joint Event Detection and Description in Continuous Video Streams, [paper], [bibtex], sources: [VisionLearningGroup/JEDDi-Net].
- [2019 CVPR] Grounded Video Description, [paper], [bibtex], sources: [facebookresearch/ActivityNet-Entities], [facebookresearch/grounded-video-description].
- [2019 CSUR] Video Description: A Survey of Methods, Datasets, and Evaluation Metrics, [paper], [bibtex].
- [2019 ACL] Dense Procedure Captioning in Narrated Instructional Videos, [paper], [bibtex].
- [2019 ACL] Multimodal Abstractive Summarization for How2 Videos, [paper], [bibtex].
- [2019 EMNLP] Guiding the Flowing of Semantics: Interpretable Video Captioning via POS Tag, [paper], [bibtex].
- [2019 ICCV] Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning, [paper], [bibtex].
- [2020 ICCV] VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research, [paper], [bibtex], [homepage].
- [2020 ACL] MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, [paper], [bibtex], sources: [jayleicn/recurrent-transformer].