RLHF: Reinforcement Learning from Human Feedback—aligning language models to follow human intent using rewards derived from preference data
Bradley-Terry model: A statistical model for estimating the probability that one item is preferred over another based on their latent reward scores
Best-of-N: An inference strategy where N responses are generated, scored by a reward model, and the highest-scoring response is selected
Chain-of-Thought: A prompting technique where models generate intermediate reasoning steps before producing a final answer
Self-consistency: An inference technique that samples multiple reasoning paths and aggregates the results (e.g., via voting or averaging) to improve reliability
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs
Off-policy training: Training on data generated by a different policy (e.g., oracle critiques) rather than the model's own current predictions
On-policy training: Training the model on its own generated outputs (self-generated critiques) to reduce distribution shift during inference