SCoRe: Self-Correction via Reinforcement Learning—the proposed multi-turn RL method for training intrinsic self-correction.
intrinsic self-correction: The ability of a model to correct its own mistakes without any external feedback (like ground truth or human hints).
behavior collapse: A failure mode where the model learns to produce the best possible first response and then makes no substantive edits in the second turn, effectively ignoring the self-correction instruction.
SFT: Supervised Fine-Tuning—training a model on a fixed dataset of examples.
STaR: Self-Taught Reasoner—an iterative training method where a model generates reasoning traces, filters for correct ones, and fine-tunes on them.
REINFORCE: A policy gradient algorithm in Reinforcement Learning that updates model weights based on the reward received for generated actions.
KL divergence: Kullback-Leibler divergence—a statistical measure used here to penalize the model for deviating too far from a reference policy (usually the base model).
on-policy: Training using data generated by the current version of the model being trained, ensuring the data distribution matches the model's behavior.
reward shaping: Modifying the reward function (e.g., adding a bonus for improvement) to guide the learning process toward desired behaviors.
edit distance: A metric measuring how dissimilar two strings are; used here to quantify how much the model changes its answer between attempts.