Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers

Research output: Contribution to journalConference articlepeer-review

Abstract

Large-Scale transformer models pose challenges due to resource-intensive training, time, and data requirements for fine-tuning on new tasks, mainly due to their extensive parameter count. To address this, zeroshot and few-shot learning, aided by techniques like prompts and parameter-efficient modules, have emerged. However, these techniques are often tailored for vision-only or language-only tasks, leaving a gap for their effectiveness in multi-modal tasks like image captioning. This paper explores the effectiveness of prompts and parameter-efficient modules in reducing the training effort for image captioning. Rather than extensive fine-tuning, we trained only the prompt and parameter-efficient modules on the pretrained Oscar transformer model using the COCO dataset. We tested five prompt tuning approaches and two parameter-efficient methods. Notably, combining visual prompt tuning(VPT) with Adapter and LoRA led to a 2% Cider score improvement after just one epoch training, with a minimal increase in trainable parameters (5.7%). Our work paves the way towards using single-stream transformer models for a variety of fine-tuned tasks, but with a huge potential reduction in retraining time and processing resources.

Keywords

  • Image Captioning
  • Parameter-Efficient Tuning
  • Prompts
  • Vision-Language Transformer

Fingerprint

Dive into the research topics of 'Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers'. Together they form a unique fingerprint.

Cite this