CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
RL: Reinforcement Learning—a machine learning paradigm where an agent learns to make decisions by performing actions and receiving rewards
GRPO: Group Relative Policy Optimization—a policy gradient algorithm that estimates advantages by normalizing rewards within a group of sampled outputs for the same input, removing the need for a separate value network
PPO: Proximal Policy Optimization—a standard RL algorithm that updates policies using a clipped objective to prevent large, unstable updates
hallucination: The generation of factually incorrect, nonsensical, or unfaithful content by a language model
policy gradient: An optimization technique in RL that updates the policy parameters in the direction of the gradient of expected reward
advantage: A value measuring how much better a specific action is compared to the average action in a given state
entropy: A measure of randomness or uncertainty in the model's predictions; high entropy implies the model is exploring many possibilities
spurious local optima: Suboptimal solutions where the model converges to a behavior (like confidently outputting a wrong answer) that yields zero reward but lacks gradient signal to correct itself
REINFORCE: A fundamental policy gradient algorithm that uses Monte Carlo sampling to estimate gradients
entailment: A logical relationship where the truth of one statement (evidence) guarantees the truth of another (generated text)