LRM: Large Reasoning Model—models like DeepSeek-R1 that generate extended thinking traces before the final answer
HGPO: Hybrid Group Policy Optimization—the proposed reinforcement learning algorithm that trains the model to select the best reasoning mode and generate high-quality answers
HFT: Hybrid Fine-Tuning—the cold-start supervised training stage using a mix of reasoning-heavy and direct-answer data
Overthinking: The phenomenon where reasoning models generate unnecessary thought traces for simple queries, wasting compute
Hybrid Accuracy: A metric measuring the proportion of prompts where the model's selected reasoning mode matches the ground-truth optimal mode
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input
KL divergence: A statistical distance measure used here to prevent the trained policy from deviating too far from the reference model
Thinking Mode: A generation mode where the model produces internal reasoning traces (e.g., within <think> tags) before the final answer
No-Thinking Mode: A generation mode where the model produces the final answer directly without internal reasoning traces