← Back to Paper List

Think Only When You Need with Large Hybrid-Reasoning Models

Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, Furu Wei
Microsoft Research, Peking University
arXiv.org (2025)
Reasoning RL

📝 Paper Summary

Large Reasoning Models (LRMs) Inference efficiency Adaptive computation
LHRMs adaptively switch between extended reasoning and direct answering based on query complexity to improve efficiency and reduce overthinking without sacrificing performance.
Core Problem
Current Large Reasoning Models (LRMs) suffer from 'overthinking,' expending excessive computational resources and latency on simple queries where extended reasoning traces are unnecessary.
Why it matters:
  • Wasteful token consumption and high latency on trivial inputs (e.g., 'Hello') make deployment inefficient
  • Existing methods focus on converting LLMs to LRMs but ignore the practical overhead of constant reasoning traces
  • Uniformly applying heavy reasoning to all queries misaligns with human cognitive patterns, where simple tasks are handled intuitively
Concrete Example: When a user inputs a trivial greeting like 'Hello', a standard LRM might generate a lengthy, unnecessary internal reasoning trace before responding, whereas the proposed LHRM detects the simplicity and bypasses the thinking process.
Key Novelty
Adaptive Reasoning Mode Selection via Hybrid Group Policy Optimization (HGPO)
  • Introduces a model that supports two distinct modes—Thinking (reasoning-intensive) and No-Thinking (direct answer)—and learns to select the optimal one per query
  • Uses a two-stage pipeline: Hybrid Fine-Tuning (HFT) to initialize both modes, followed by HGPO (a novel RL method) to learn the switching policy based on query context
  • Defines a new metric, Hybrid Accuracy, to evaluate how well the model's chosen mode aligns with the optimal mode for a given task
Evaluation Highlights
  • Outperforms existing LRMs and LLMs in reasoning and general capabilities while improving efficiency
  • Hybrid Accuracy metric correlates strongly with human expert judgment on mode selection
  • Effectively handles queries of varying difficulty and type across math, programming, and general domains (Qwen-2.5 1.5B to 7B scales)
Breakthrough Assessment
8/10
Significant step toward efficient deployment of reasoning models. Addressing the 'overthinking' problem via adaptive RL is a practical and methodologically sound advancement.
×