RLVR: Reinforcement Learning with Verifiable Rewards—using outcome-based feedback (e.g., correct answer) to train reasoning models
Grokking: A learning dynamic where performance remains flat (near chance) for a long period before suddenly jumping to high accuracy
Relay Effect: A mechanism where gradients from solving easier/shorter tasks improve the model just enough to make slightly harder tasks solvable, creating a continuous chain of progress
Edge of Competence: The difficulty regime where a model has non-trivial success rates (not pure guessing) but has not yet mastered the task; the optimal zone for RL training
REINFORCE: A basic policy gradient algorithm that updates model parameters based on the product of the reward and the gradient of the log-probability of the action
SFT: Supervised Fine-Tuning—training on dataset examples with immediate supervision (ground truth next-token targets)
Fourier Analysis on Groups: A mathematical technique decomposing functions on a group into irreducible representations, used here to analyze convolution of probability measures
Atomic Skill: The basic single-step operation (e.g., one group multiplication) which the model's MLP is assumed to already possess, leaving the Attention layer to learn composition