skip to main content
Language:
Search Limited to: Search Limited to: Resource type Show Results with: Show Results with: Search type Index

Learning to Prompt for Vision-Language Models

International journal of computer vision, 2022-09, Vol.130 (9), p.2337-2348 [Peer Reviewed Journal]

The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. ;COPYRIGHT 2022 Springer ;ISSN: 0920-5691 ;EISSN: 1573-1405 ;DOI: 10.1007/s11263-022-01653-1

Full text available

  • Title:
    Learning to Prompt for Vision-Language Models
  • Author: Zhou, Kaiyang ; Yang, Jingkang ; Loy, Chen Change ; Liu, Ziwei
  • Subjects: Artificial Intelligence ; Computational linguistics ; Computer Imaging ; Computer Science ; Context ; Domains ; Image Processing and Computer Vision ; Language ; Language processing ; Learning ; Natural language ; Natural language interfaces ; Natural language processing ; Object recognition ; Optimization ; Pattern Recognition ; Pattern Recognition and Graphics ; Representations ; Vision
  • Is Part Of: International journal of computer vision, 2022-09, Vol.130 (9), p.2337-2348
  • Description: Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting , i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp) , a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.
  • Publisher: New York: Springer US
  • Language: English
  • Identifier: ISSN: 0920-5691
    EISSN: 1573-1405
    DOI: 10.1007/s11263-022-01653-1
  • Source: ProQuest Central

Searching Remote Databases, Please Wait