RLHF: Reinforcement Learning from Human Feedback—a method to align AI models with human values by training a reward model on human preferences and optimizing the policy against it
Reward Hacking: A failure mode where the AI optimizes for the reward score (the metric) rather than the intended high-quality behavior, often exploiting flaws in the reward model
SFT: Supervised Fine-Tuning—the initial training phase where the model learns to mimic high-quality demonstration data before RL is applied
Fact-RLHF: Factually Augmented RLHF—the paper's proposed method where the reward model is given extra factual context (captions, answers) to better judge the policy's truthfulness
PPO: Proximal Policy Optimization—an RL algorithm used to update the model's policy while ensuring stability by limiting how much the policy changes in one step
Hallucination: In LMMs, generating text that is not grounded in or contradicts the visual information provided in the image context
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices
LMM: Large Multimodal Model—a deep learning model capable of processing and generating output for multiple modalities, typically image and text
KL penalty: Kullback-Leibler penalty—a regularization term added to the RL loss to prevent the model from drifting too far from its initial learned behavior