Long CoT: Long Chain-of-Thought—a reasoning format involving detailed, iterative steps, self-reflection, and verification, used by models like OpenAI-o1
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs for the same input, eliminating the need for a separate value function critic
Ada-GRPO: Adaptive GRPO—the authors' proposed variant that adds a time-decaying diversity reward to prevent the model from converging to a single reasoning format
Format Collapse: The phenomenon where an RL-trained model converges to using only the highest-accuracy format (usually Long CoT) for all tasks, losing the ability to use efficient formats
SFT: Supervised Fine-Tuning—training the model on labeled examples (here, questions paired with answers in specific formats) before RL
Code: A reasoning format that uses programming code (e.g., Python) to structure the problem-solving process