VAS: Value Augmented Sampling—the proposed method that uses a value function to adjust token probabilities at inference time
PPO: Proximal Policy Optimization—a standard RL algorithm that updates model weights to maximize reward while limiting deviation from the old policy
DPO: Direct Preference Optimization—a method that aligns models using preference pairs without an explicit reward model loop
BoN: Best-of-N—a search strategy that generates N full responses and selects the one with the highest reward
MCTS: Monte-Carlo Tree Search—a search algorithm that explores future states to make optimal decisions
Value function: A function estimating the total expected future reward from a specific state (current text sequence)
Q-value: The expected future reward of taking a specific action (next token) in a specific state
SFT: Supervised Fine-Tuning—the initial training phase using labeled examples before alignment
TD learning: Temporal Difference learning—an RL method to update value estimates based on the difference between current and future estimates
KL divergence: A measure of how much one probability distribution differs from another, used here to keep the aligned model close to the original
Black-box model: A model (like GPT-4) where only inputs and outputs are accessible, not internal weights or gradients
FLOPS: Floating Point Operations Per Second—a measure of computational cost
FUDGE: A prior method using a classifier to guide decoding; VAS differs by using a value function trained via RL
Alignment tax: The loss of general capabilities (e.g., reasoning) when a model is fine-tuned for a specific narrow objective