Abstract

Large-Scale transformer models pose challenges due to resource-intensive training, time, and data requirements for fine-tuning on new tasks, mainly due to their extensive parameter count. To address this, zeroshot and few-shot learning, aided by techniques like prompts and parameter-efficient modules, have emerged. However, these techniques are often tailored for vision-only or language-only tasks, leaving a gap for their effectiveness in multi-modal tasks like image captioning. This paper explores the effectiveness of prompts and parameter-efficient modules in reducing the training effort for image captioning. Rather than extensive fine-tuning, we trained only the prompt and parameter-efficient modules on the pretrained Oscar transformer model using the COCO dataset. We tested five prompt tuning approaches and two parameter-efficient methods. Notably, combining visual prompt tuning(VPT) with Adapter and LoRA led to a 2% Cider score improvement after just one epoch training, with a minimal increase in trainable parameters (5.7%). Our work paves the way towards using single-stream transformer models for a variety of fine-tuned tasks, but with a huge potential reduction in retraining time and processing resources.

Original languageEnglish
Title of host publicationProceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Pages501-508
Number of pages8
Volume2
DOIs
Publication statusPublished - 2024
Event19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2024 - Rome, Italy
Duration: 27 Feb 202429 Feb 2024

Publication series

NameProceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
PublisherScience and Technology Publications, Lda
ISSN (Print)2184-5921

Conference

Conference19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2024
Country/TerritoryItaly
CityRome
Period27/02/2429/02/24

Keywords

  • Image Captioning
  • Parameter-Efficient Tuning
  • Prompts
  • Vision-Language Transformer

Fingerprint

Dive into the research topics of 'Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers'. Together they form a unique fingerprint.

Cite this