RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using RL where the reward is based on the objective correctness of the final answer (e.g., math problems)
CoNet: Concept Network Model—a minimal computational proxy used by the authors to simulate the coarse-grained reasoning graph of an LLM without dealing with the full high-dimensional latent space
SFT: Supervised Fine-Tuning—training a model on labeled examples
concept web: The authors' theoretical construct for the coarse-grained backbone of an LLM's reasoning graph, posited to be a sparse network with average degree ~2
V-shaped trajectory: The phenomenon where the length of correct reasoning chains first decreases (local optimization) then increases (global integration) during training
catastrophic forgetting: The abrupt degradation of previously learned capabilities when a model is trained on new data
policy collapse: A reduction in the diversity of solutions generated by the model, where it converges to a narrow set of rigid trajectories
GRPO: Group Relative Policy Optimization—an RL algorithm used as the baseline and cooling stage in this paper
annealing: In this context, a training strategy that temporarily increases 'temperature' (via SFT) to break local optima before 'cooling' (resuming RL) to settle into a better state