LVLM: Large Vision Language Model—a multimodal model capable of understanding images and text instructions.
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without training a separate reward model, by minimizing a classification loss on preference pairs.
In-context self-critic: A method where the model evaluates its own outputs by being prompted with specific criteria and examples within the input context, rather than using a trained reward model.
CHAIR: Captioning Hallucination Assessment with Image Relevance—a metric measuring the proportion of objects mentioned in a caption that do not exist in the image.
Greedy decoding: A generation strategy where the model always selects the highest probability token.
Temperature sampling: A generation strategy that introduces randomness by scaling logits, allowing for more diverse (and potentially erroneous) outputs.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of parameters.
RLHF: Reinforcement Learning from Human Feedback—a technique to align models using a reward model trained on human preferences.