LRM: Large Reasoning Model—an LLM specialized in complex reasoning tasks, often generating long chains of thought
CoT: Chain-of-Thought—intermediate reasoning steps generated by a model before producing the final answer
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing multiple outputs for the same input within a group
Pareto-optimal: A state where no metric (e.g., accuracy) can be improved without degrading another (e.g., efficiency/length)
Over-thinking: The phenomenon where reasoning models generate excessively long, redundant, or looping thoughts for simple problems
ECR: Expected Correct Responses—a metric used to estimate how many correct answers a model can produce given a specific length limit
KL-constrained: Kullback-Leibler divergence constrained—keeping the trained model's probability distribution close to a reference model to prevent training collapse