This project focuses on Computer Vision methods to generate Image Captions.
- It uses transfer learning on a "Convolutional Neural Network (CNN)" and uses it as encoder for LSTM based RNN
- CNN instead of generating class scrores for image, has been modified to generate only feature maps by removing linear layer used in prediction.
- A LSTM based RNN is used as decoder which then takes captions from training data and feature map from CNN to train in generating captions.
Automatic Image Captioning using Encoder CNN & Decoder RNN.ipynb
Notebook contains all code for training and prediction.Node.js
is used to create API for serving model for Web Apps. A Single page web app is located inNode.js_Server
- Application takes images and shows proper caption for it.
"a laptop computer sitting on top of a desk" | "a plate of food with a fork and fork" |
---|---|
Detailed data visualization, training and inference notebooks are as follows:-
Notebook 1 : Testing COCO dataset API and running a visualization on sample images
Notebook 2 : Implementing and testing data loaders and tokenizers
Notebook 3 : Training Encoder-Decoder model
Notebook 4 : Testing trained model on test dataset and input images
To run the notebooks properly move them to project's root directory.
Install required python packages from requirements file using:-
pip install -r requirements.txt
-
Training, Testing and Validation datasets are over 24 GB, hence are needed to be downloaded from source. (https://cocodataset.org/#download)
a. Training Dataset: http://images.cocodataset.org/zips/train2014.zip b. Testing Dataset: http://images.cocodataset.org/zips/test2014.zip c. Validation Dataset: http://images.cocodataset.org/zips/val2014.zip d. Annotations: http://images.cocodataset.org/annotations/image_info_test2014.zip http://images.cocodataset.org/annotations/annotations_trainval2014.zip
-
Trained Model's checkpoint (only upto 2 epochs) is located in
model_checkpoints
as well asNode.js_Server/python_models/saved_models
To setup download datasets and annotations and extract everything in Data
directory.
With batch size = 32 and model as per in training notebooks:-
-
It takes 2.30 Hrs (Approx) to run a single epoch on following hardware resources available on Google Colab
Intel(R) Xeon(R) CPU @ 2.20GHz [Core(s) per socket: 1 | Thread(s) per core: 2 ] Tesla T4 [CUDA Version: 10.1]
It takes 8 Hrs (Approx. as per back calculation basd on time taken for 100 steps) to run single epoch on following local hardware:-
Intel(R) Core(TM) i3-2120 CPU @ 3.20GHz [Core(s) : 2 | Thread(s) per core: 2 ]
GTX 1060 [CUDA Version: 10.1]
Implemented Node.js Application works by creating a python child process for generating captions for images.
-
Navigate to
Node.js_Server
-
Place the trained pytorch model's checkpoint file
checkpoint.pth
topython_models/saved_models
folder (in case new model was trained). -
In case new vocab.pkl file was genereated, or new model's checkpoint filename is different, change following variables in
Node.js_Server/python_models/model.py
to appropriate names as per requirement.ENCODER_CNN_CHECKPOINT = "python_models/saved_models/encoderEpoch_2.pth" DECODER_LSTM_RNN_CHECKPOINT = "python_models/saved_models/decoderEpoch_2.pth" VOCAB_FILE = "python_models/saved_models/vocab.pkl"
-
Run following commands
npm install (This installs Node.js dependencies) pip install -r requirements.txt (If python packages haven't been already installed from project root) npm start
-
To test it on other devices on local network
node app.js <your_ip>:8000
Example
node app.js 192.168.32.134:8000