← Back to Paper List

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yezhou Yang
ReLER Lab, AAII, University of Technology Sydney, ReLER Lab, CCAI, Zhejiang University, Stanford University, Baidu Inc.
International Conference on Learning Representations (2023)
MM RL

📝 Paper Summary

Vision-Language Models (VLMs) Test-Time Adaptation (TTA)
RLCF improves zero-shot performance of vision-language models at test time by using CLIP as a reward model to reinforce generated outputs via policy gradient, without needing ground-truth labels.
Core Problem
Existing test-time adaptation methods for zero-shot VLMs rely on entropy minimization, which often makes models blindly confident in incorrect predictions due to distribution shifts.
Why it matters:
  • Large domain gaps between pre-training and testing data hinder zero-shot transferability in real-world applications.
  • Entropy minimization (making the model confident) is a double-edged sword: it reduces error on correct predictions but locks the model into wrong answers.
  • Current methods often lack a mechanism to rectify incorrect outputs or provide external guidance when ground truth is absent.
Concrete Example: In a classification task, if a model incorrectly predicts 'horse' for an image of a dog, entropy minimization will simply make it more confident in 'horse'. In contrast, RLCF uses CLIP feedback to identify that the image looks more like 'dog', correcting the prediction.
Key Novelty
Reinforcement Learning with CLIP Feedback (RLCF)
  • Treats test-time adaptation as a reinforcement learning problem where the VLM is the policy and CLIP is the reward model.
  • Uses the similarity score from a frozen CLIP model (CLIPScore) to evaluate candidate outputs generated by the VLM for a single test sample.
  • Updates the VLM's parameters (or prompt tokens) to maximize this reward, effectively aligning the model's output with CLIP's robust embedding space during inference.
Evaluation Highlights
  • Outperforms TPT (Test-Time Prompt Tuning) by 5.4% on average across 15 datasets for zero-shot image classification using ResNet-50.
  • Achieves state-of-the-art results on ImageNet-A (OOD dataset), improving top-1 accuracy by 2.3% over TPT using ViT-B/16.
  • Improves zero-shot image captioning on Flickr30k by 2.2 CIDEr points using CapDec.
Breakthrough Assessment
7/10
Proposes a universal framework applicable to multiple VLM tasks (classification, retrieval, captioning) with consistent gains. The use of CLIP as a reward signal is intuitive and effective, though reliant on CLIP's own calibration.
×