This repo is the released code of dense image captioning models described in the CVPR 2017 paper:
@InProceedings{CVPR17,
author = "Linjie Yang and Kevin Tang and Jianchao Yang and Li-Jia Li",
title = "Dense Captioning with Joint Inference and Visual Context",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition (CVPR)",
month = "Jul",
year = "2017"
}
All code is provided for research purposes only and without any warranty. Any commercial use requires our consent. When using the code in your research work, please cite the above paper. Our code is adapted from the popular Faster-RCNN repo written by Ross Girshick, which is based on the open source deep learning framework Caffe. The evaluation code is adapted from COCO captioning evaluation code.
Please follow official guide. Support CUDA 7.5+, CUDNN 5.0+. Tested on Ubuntu 14.04.
cd lib
make
Download official sample model here. This model is the Twin-LSTM with late context fusion (fused by summation) described in the paper. To test the model, run the following command in the library root folder.
python ./lib/tools/demo.py --image [IMAGE_PATH] --gpu [GPU_ID] --net [MODEL_PATH]
It will generate a folder named "demo" in the library root. Inside the "demo" folder, there will be an HTML page showing the predicted results.
For model training you will need to download the visual genome dataset from Visual Genome Website, either 1.0 or 1.2 is fine.
Download pre-trained VGG16 model from link.
Modify data paths in models/dense_cap/preprocess.py
and run it from the library root to generate training/validation/testing data.
Run models/dense_cap/dense_cap_train.sh
to start training. For example, to train a model with joint inference and visual context (late fusion, feature summation) on visual genome 1.0:
./models/dense_cap/dense_cap_train.sh [GPU_ID] visual_genome late_fusion_sum [VGG_MODEL_PATH]
It typically takes 3 days to finish training. Note that due to the limitation of Python, multi-GPU training is not available for this library. In this library, we only provide Twin-LSTM structure for joint inference and late fusion (with three different fusion operators: summation, multiplication, concatenation) for context fusion. Other structures described in the paper can be easily implemented by adapting the existing code.
Modify models/dense_cap/dense_cap_test.sh
according to the model you want to test. For example, if you want to test the provided sample model, it will look like this:
GPU_ID=0
NET_FINAL=models/dense_cap/dense_cap_late_fusion_sum.caffemodel
TEST_IMDB="vg_1.0_test"
PT_DIR="dense_cap"
time ./lib/tools/test_net.py --gpu ${GPU_ID} \
--def_feature models/${PT_DIR}/vgg_region_global_feature.prototxt \
--def_recurrent models/${PT_DIR}/test_cap_pred_context.prototxt \
--def_embed models/${PT_DIR}/test_word_embedding.prototxt \
--net ${NET_FINAL} \
--imdb ${TEST_IMDB} \
--cfg models/${PT_DIR}/dense_cap.yml \
The sample model will get an mAP of around 9.05.
Except the model path(NET_FINAL
), the only thing you should change is def_recurrent
, which should be models/${PT_DIR}/test_cap_pred_no_context.prototxt
for models without context information and models/${PT_DIR}/test_cap_pred_context.prototxt
for models with context fusion.
If you want to test late fusion models with other fusion operators, you need to modify test_cap_pred_context.prototxt
. Change the "local_global_fusion" layer to eltwise multiplication or concatenation accordingly.
To visualize the result, you can add --vis
to the end of the above script. It will generate html pages for each image visualizing the results under folder output/dense_cap/${TEST_IMDB}/vis
.
If you have any questions regarding the repo, please send email to Linjie Yang ([email protected]).