← Back to Paper List

Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

A Liu, B Zhou, C Xu, C Zhou, CC Zhang, C Xu, C Wang…
Tencent
arXiv, 5/2025 (2025)
Pretraining Reasoning RL Benchmark

📝 Paper Summary

Hybrid LLM Architectures Efficient Long-Context Processing Post-training Optimization
Hunyuan-TurboS combines a hybrid Transformer-Mamba MoE architecture for efficient inference with an adaptive chain-of-thought mechanism that dynamically switches between fast responses and deep reasoning based on problem complexity.
Core Problem
Standard Transformer LLMs suffer from high inference costs and KV cache memory usage for long sequences, while reasoning models often over-compute on simple queries or lack efficiency.
Why it matters:
  • Pure Transformers have quadratic complexity, making long-context inference slow and memory-intensive.
  • Existing reasoning models (like o1) apply heavy compute indiscriminately, wasting resources on simple tasks.
  • Deploying large-scale reasoning models at industry scale requires balancing high performance with strictly constrained inference costs.
Concrete Example: A pure Transformer reasoning model might use a long chain-of-thought to answer 'What is 2+2?', wasting tokens. Hunyuan-TurboS detects this simplicity and uses a short path, while activating deep reasoning only for complex math problems.
Key Novelty
Hybrid Mamba-Transformer MoE with Adaptive Reasoning
  • Integrates Mamba2 layers (linear complexity) with Attention layers (contextual capability) and MoE FFNs to reduce active parameters and KV cache.
  • Implements an 'Adaptive Long-short Chain-of-Thought' where the model autonomously selects 'thinking' mode for hard tasks or rapid response for simple ones.
  • Uses a multi-stage post-training pipeline including Deliberation Learning and two-stage GRPO to refine both reasoning and general instruction following.
Evaluation Highlights
  • Achieved a 1356 score on LMSYS Chatbot Arena, ranking top 7 overall and outperforming o4-mini-2025-04-16.
  • Outperforms GPT-4.5 on math benchmarks (GSM8K: 94.39% vs 91.9%; MATH: 90.0% vs 86.2%).
  • Reduces inference cost significantly: requires only 40.5% of Qwen3-235B-A22B's generation cost while maintaining competitive performance.
Breakthrough Assessment
9/10
First industry-scale deployment of a large Mamba-based model (560B params). Successfully demonstrates that hybrid architectures can rival pure Transformers in performance while significantly cutting inference costs.
×