← Back to Paper List

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, Chen Ma
City University of Hong Kong, McGill University & MILA, Renmin University of China, Stanford University
arXiv (2025)
Reasoning RL Agent Benchmark

📝 Paper Summary

Inference-time compute Reasoning strategies System 2 thinking in LLMs
The paper proposes a hierarchical taxonomy for Test-Time Scaling—classifying methods by what, how, where, and how well they scale—to systematize techniques that trade inference compute for enhanced reasoning.
Core Problem
Pre-training scaling is reaching diminishing returns and resource limits, while test-time strategies (like o1) are exploding in popularity but lack a unified framework for comparison and systematic understanding.
Why it matters:
  • Training-time scaling (more parameters/data) is becoming prohibitively expensive and data-constrained
  • Inference strategies enable 'System 2' deliberate reasoning, unlocking capabilities in math and coding that single-pass generation cannot achieve
  • Current research is fragmented; a unified taxonomy is needed to identify open challenges like generalizing to non-reasoning tasks and clarifying the essence of scaling techniques
Concrete Example: When an LLM attempts a complex math proof, a standard single-pass inference often fails. A 'Parallel Scaling' approach might sample 100 solutions and vote, while a 'Sequential Scaling' approach might iteratively refine one solution. Without this survey's framework, it is unclear how to compare the compute-accuracy trade-offs of these distinct strategies.
Key Novelty
The 'What, How, Where, How Well' Taxonomy
  • Decomposes Test-Time Scaling (TTS) into four orthogonal axes: 'What' (Parallel, Sequential, Hybrid, Internal), 'How' (Tuning vs. Inference), 'Where' (Tasks), and 'How Well' (Evaluation)
  • Distinguishes between 'External' scaling (explicit search/sampling) and 'Internal' scaling (models like o1 that learn to allocate latent compute autonomously)
  • Unifies diverse techniques like Best-of-N, Tree of Thoughts, and Process Reward Models under a single structural framework to reveal development trajectories
Breakthrough Assessment
8/10
This is a timely and comprehensive survey that organizes a chaotic, high-impact field (inference scaling/o1-style reasoning). While it proposes no new model, its taxonomy is likely to become the standard reference for future work.
×