RLCF: Reinforcement Learning with CLIP Feedback—the proposed framework using CLIP scores as rewards to update models at test time.
TTA: Test-Time Adaptation—adjusting a pre-trained model's parameters during inference on test data to handle distribution shifts.
CLIP: Contrastive Language-Image Pre-training—a model trained to align images and text in a shared embedding space, used here as a reward model.
TPT: Test-Time Prompt Tuning—a baseline method that optimizes learnable prompt tokens by minimizing output entropy.
REINFORCE: A Monte-Carlo policy gradient algorithm used to optimize parameters to maximize expected reward.
CLIPScore: A metric measuring the cosine similarity between image and text embeddings from a CLIP model, used here as the reward signal.
OOD: Out-of-Distribution—data that differs significantly from the training distribution.
CIDEr: Consensus-based Image Description Evaluation—a metric for evaluating image captioning quality.
LLM: Large Language Model—generative text models used in the captioning pipeline.
Beam search: A search algorithm that explores a graph by expanding the most promising nodes in a limited set.
Momentum buffer: A technique to store and update a moving average of model parameters to enable incremental learning across test samples.