RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using a reward model trained on human preferences.
VLM: Vision-Language Model—a model that processes both images and text to generate text or embeddings.
Reward Hacking: A phenomenon where a generative model optimizes for the reward signal (getting a high score) without actually improving the underlying quality or human preference, often by exploiting bugs in the reward model.
CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer.
Bradley-Terry Model: A statistical model for estimating the probability that one item is preferred over another in a pairwise comparison.
BoN: Best-of-N—a sampling strategy where N candidates are generated, and the one with the highest reward score is selected.
ReFL: Reward-weighted Feedback Learning—an algorithm that optimizes diffusion models using gradients from a frozen reward model without computing log-likelihoods.
ODE sampling: Ordinary Differential Equation sampling—a deterministic method for generating samples from diffusion models by solving the probability flow ODE.
Search over Paths: An inference-time scaling technique that prunes generation trajectories during sampling based on intermediate reward feedback.