TY - GEN
T1 - Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers
AU - Liu, Xuehao
AU - Delany, Sarah Jane
AU - McKeever, Susan
N1 - Publisher Copyright:
© 2024 by SCITEPRESS – Science and Technology Publications, Lda.
PY - 2024
Y1 - 2024
N2 - Large-Scale transformer models pose challenges due to resource-intensive training, time, and data requirements for fine-tuning on new tasks, mainly due to their extensive parameter count. To address this, zeroshot and few-shot learning, aided by techniques like prompts and parameter-efficient modules, have emerged. However, these techniques are often tailored for vision-only or language-only tasks, leaving a gap for their effectiveness in multi-modal tasks like image captioning. This paper explores the effectiveness of prompts and parameter-efficient modules in reducing the training effort for image captioning. Rather than extensive fine-tuning, we trained only the prompt and parameter-efficient modules on the pretrained Oscar transformer model using the COCO dataset. We tested five prompt tuning approaches and two parameter-efficient methods. Notably, combining visual prompt tuning(VPT) with Adapter and LoRA led to a 2% Cider score improvement after just one epoch training, with a minimal increase in trainable parameters (5.7%). Our work paves the way towards using single-stream transformer models for a variety of fine-tuned tasks, but with a huge potential reduction in retraining time and processing resources.
AB - Large-Scale transformer models pose challenges due to resource-intensive training, time, and data requirements for fine-tuning on new tasks, mainly due to their extensive parameter count. To address this, zeroshot and few-shot learning, aided by techniques like prompts and parameter-efficient modules, have emerged. However, these techniques are often tailored for vision-only or language-only tasks, leaving a gap for their effectiveness in multi-modal tasks like image captioning. This paper explores the effectiveness of prompts and parameter-efficient modules in reducing the training effort for image captioning. Rather than extensive fine-tuning, we trained only the prompt and parameter-efficient modules on the pretrained Oscar transformer model using the COCO dataset. We tested five prompt tuning approaches and two parameter-efficient methods. Notably, combining visual prompt tuning(VPT) with Adapter and LoRA led to a 2% Cider score improvement after just one epoch training, with a minimal increase in trainable parameters (5.7%). Our work paves the way towards using single-stream transformer models for a variety of fine-tuned tasks, but with a huge potential reduction in retraining time and processing resources.
KW - Image Captioning
KW - Parameter-Efficient Tuning
KW - Prompts
KW - Vision-Language Transformer
UR - http://www.scopus.com/inward/record.url?scp=85192147737&partnerID=8YFLogxK
U2 - 10.5220/0012364800003660
DO - 10.5220/0012364800003660
M3 - Conference contribution
AN - SCOPUS:85192147737
VL - 2
T3 - Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
SP - 501
EP - 508
BT - Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
T2 - 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2024
Y2 - 27 February 2024 through 29 February 2024
ER -