In this project, a combination of Transformers was used to perform the Image Captioning task. A qualitative evaluation on a never seen dataset was done in order to check if the model can generalize well. In this case, it was evaluated in the area of vehicle navigation with the dataset Cityscapes. The details are described below:
-
⚙️ Architecture: encoder-decoder type
- Encoder: ViT Hugging Face link (paper [1])
- Decoder: T5 decoder Hugging Face link (paper [2])
-
🖼️ Datasets:
-
🏋️♀️ Training:
- 40 epochs
- GPU: Tesla P100-PCIE-16GB
- Dataloader with num_workers of 3 and batch_size of 20
- Loss function: cross-entropy (Hugging Face)
- Optimizer: Adam with learning rate of 1e-5
-
📈 Experiment tracking: Weights & Biases
Image composed with images from Hugging Face and from papers [1] and [2]
All codes are located in categorized notebooks, each one is called according to the necessity of the variables or functions to be imported. Google Colab Pro was used in this project due to the large size of the models.
- run00_dataset.ipynb: install and import libraries, import dataset, declare configuration variables and other functions.
- run01_metrics.ipynb: declare function to calculate metrics.
- run02_models.ipynb: model is defined in this notebook.
- run03_training_exp008.ipynb: contains all trainings and evaluations looping over the epochs.
- run04_evaluation_exp008.ipynb: all datasets splits (including the original validation) and also a subdataset with filtered categories are evaluated quantitatively with the COCOEvalCap tool. Filtered Categories was used with the intention of retrieving similar images with the Cityscapes dataset.
- run05_cityscapes.ipynb: zero-shot of trained model in the Cityscapes dataset.
[1] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[2] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), 1-67.
[3] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.
[4] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., ... & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3213-3223).