RLVR: Reinforcement Learning from Verifiable Rewards—optimizing models based on whether the final answer is correct, without human labels for intermediate steps
GRPO: Group Relative Policy Optimization—a policy gradient method that normalizes advantages within a group of samples for the same input to reduce variance
CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before the final answer
polluter: A role played by the model during training where it generates corrupted reasoning steps intended to mislead the agent
agent: The primary role of the model where it attempts to solve the problem or recover from corrupted context
recoverability: The ability of a model to produce a correct final answer despite starting with a partially incorrect reasoning trace
diagnosability: The ability to identify the specific step where a reasoning trace went wrong
inverse scaling: A phenomenon where larger or more capable models perform worse on a specific metric (here, susceptibility to following errors)
in-distribution guidance: A training objective that encourages the model to imitate its own successful repair trajectories, ensuring updates remain close to the model's current capabilities