This repository is the PyTorch implementation of the paper:
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)
Shweta Mahajan and Stefan Roth
We additionally include evaluation code from Luo et al. in the folder GoogleConceptualCaptioning
, which has been patched for compatibility.
The following code is written in Python 3.6.10 and CUDA 9.0.
Requirements:
- torch 1.1.0
- torchvision 0.3.0
- nltk 3.5
- inflect 4.1.0
- tqdm 4.46.0
- sklearn 0.0
- h5py 2.10.0
To install requirements:
conda config --add channels pytorch
conda config --add channels anaconda
conda config --add channels conda-forge
conda config --add channels conda-forge/label/cf202003
conda create -n <environment_name> --file requirements.txt
conda activate <environment_name>
The dataset used in this project for assessing accuracy and diversity is COCO 2014 (m-RNN split). The full dataset is available here.
We use the Faster R-CNN features for images similar to Anderson et al.. We additionally require "classes"/"scores" fields detected for image regions. The classes correspond to Visual Genome.
Preprocessed training data is available here as hdf5 files. The provided hdf5 files contain the following fields:
- image_id: ID of the COCO image
- num_boxes: The proposal regions detected from Faster R-CNN
- features: ResNet-101 features of the extracted regions
- classes: Visual genome classes of the extracted regions
- scores: Scores of the Visual genome classes of the extracted regions
Note that the ["image_id","num_boxes","features"] fields are identical to Anderson et al.
Create a folder named coco
and download the preprocessed training and test datasets from the coco folder in the link above as follows (it is also possible to directly download the entire coco folder from the link):
- Download the following files for training on COCO 2014 (m-RNN split):
coco/coco_train_2014_adaptive_withclasses.h5
coco/coco_val_2014_adaptive_withclasses.h5
coco/coco_val_mRNN.txt
coco/coco_test_mRNN.txt
- Download the following files for training on held-out COCO (novel object captioning):
coco/coco_train_2014_noc_adaptive_withclasses.h5
coco/coco_train_extra_2014_noc_adaptive_withclasses.h5
- Download the following files for testing on held-out COCO (novel object captioning):
coco/coco_test_2014_noc_adaptive_withclasses.h5
- Download the (caption) annotation files and place them in a subdirectory coco/annotations (mirroring the Google drive folder structure)
coco/annotations/captions_train2014.json
coco/annotations/captions_val2014.json
- Download the following files from here in a seperate folder data (outside coco). These files contain the contextual neighbours for pseudo supervision:
data/nn_final.pkl
data/nn_noc.pkl
For running the train/test scripts (described in the following) "pathToData"/"nn_dict_path" in params.json and params_noc.json needs to be set to the coco/data folder created above.
The folder structure of coco
after data download should be as follows,
coco
- annotations
- captions_train2014.json
- captions_val2014.json
- coco_val_mRNN.txt
- coco_test_mRNN.txt
- coco_train_2014_adaptive_withclasses.h5
- coco_val_2014_adaptive_withclasses.h5
- coco_train_2014_noc_adaptive_withclasses.h5
- coco_train_extra_2014_noc_adaptive_withclasses.h5
- coco_test_2014_noc_adaptive_withclasses.h5
data
- coco_classname.txt
- visual_genome_classes.txt
- vocab_coco_full.pkl
- nn_final.pkl
- nn_noc.pkl
Please follow the following instructions for training:
- Set hyperparameters for training in params.json and params_noc.json.
- Train a model on COCO 2014 for captioning,
python ./scripts/train.py
- Train a model for diverse novel object captioning,
python ./scripts/train_noc.py
Please note that the data
folder provides the required vocabulary.
The models were trained on a single nvidia V100 GPU with 32 GB memory. 16 GB is sufficient for training a single run.
We provide pre-trained models for both captioning on COCO 2014 (mRNN split) and novel object captioning. Please follow the following steps:
-
Download the pre-trained models from here to the
ckpts
folder. -
For evaluation of oracle scores and diversity, we follow Luo et al.. In the folder
GoogleConceptualCaptioning
download the cider and in the cococaption folder run the download scripts,
./GoogleConceptualCaptioning/cococaption/get_google_word2vec_model.sh
./GoogleConceptualCaptioning/cococaption/get_stanford_models.sh
python ./scripts/eval.py
- For diversity evaluation create the required numpy file for consensus re-ranking using,
python ./scripts/eval_diversity.py
For consensus re-ranking follow the steps here. To obtain the final diversity scores, follow the instructions of DiversityMetrics. Convert the numpy file to required json format and run the script evalscripts.py
- To evaluate the F1 score for novel object captioning,
python ./scripts/eval_noc.py
B4 | B3 | B2 | B1 | CIDEr | METEOR | ROUGE | SPICE | |
---|---|---|---|---|---|---|---|---|
COS-CVAE | 0.633 | 0.739 | 0.842 | 0.942 | 1.893 | 0.450 | 0.770 | 0.339 |
Unique | Novel | mBLEU | Div-1 | Div-2 | |
---|---|---|---|---|---|
COS-CVAE | 96.3 | 4404 | 0.53 | 0.39 | 0.57 |
bottle | bus | couch | microwave | pizza | racket | suitcase | zebra | average | |
---|---|---|---|---|---|---|---|---|---|
COS-CVAE | 35.4 | 83.6 | 53.8 | 63.2 | 86.7 | 69.5 | 46.1 | 81.7 | 65.0 |
@inproceedings{coscvae20neurips,
title = {Diverse Image Captioning with Context-Object Split Latent Spaces},
author = {Mahajan, Shweta and Roth, Stefan},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2020}
}