A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

📝 Paper Summary

LLM Reasoning Agentic AI Inference Scaling

This survey unifies recent LLM reasoning advances into a two-dimensional framework: regimes (inference scaling vs. learning-to-reason) and architectures (standalone LLMs vs. agentic systems).

Core Problem

Rapid progress in LLM reasoning has fragmented into disconnected subfields (like inference scaling vs. training), lacking a unified framework to connect standalone models with emerging agentic workflows.

Why it matters:

Reasoning distinguishes advanced AI (AGI-bound) from basic chatbots, yet scaling pre-training alone has hit diminishing returns for complex logic.
Prior surveys focus narrowly on either prompting or agents, missing the critical transition from inference-time compute (e.g., o1) to learning-to-reason (e.g., R1).
Researchers need a structured map to navigate the shift from static models to dynamic, multi-step agentic systems involving tools and multi-agent collaboration.

Concrete Example: Traditional CoT prompting (standalone inference) fails on complex tasks like 'Deep Research' requiring web browsing. The survey contrasts this with agentic workflows (like OpenAI Deep Research) that iterate through perception and action, showing how they differ structurally yet share underlying optimization principles.

Key Novelty

Orthogonal Taxonomy of Reasoning Frontiers

Categorizes methods by **Regime** (when reasoning happens: inference-time search vs. training-time RL) and **Architecture** (component structure: standalone LLM vs. single/multi-agent systems).
Unifies diverse techniques under **Input/Output** perspectives: prompt engineering aligns with agent perception/communication, while candidate refinement aligns with agent action/coordination.

Architecture

A unified perspective of the survey's categorization, mapping 'Input' and 'Output' concepts across Standalone LLMs, Single-Agent Systems, and Multi-Agent Systems.

Breakthrough Assessment

9/10

This is a timely, foundational survey that provides the first comprehensive taxonomy connecting the explosion of recent reasoning models (o1, DeepSeek-R1) with agentic systems.

⚙️ Technical Details

Problem Definition

Setting: Survey and taxonomy construction for Large Language Model Reasoning and Agentic Systems

Inputs: Literature on LLMs, CoT, RL (PPO/GRPO), Search, and Agents

Outputs: Structured categorization and analysis of reasoning methods

Pipeline Flow

Dimension 1: Regimes (Inference Scaling vs. Learning to Reason)
Dimension 2: Architectures (Standalone vs. Single-Agent vs. Multi-Agent)
Unifying Lens: Input (Prompt/Perception) vs. Output (Refinement/Action)

System Modules

Inference Scaling Regime (Regime Classification)

Optimize reasoning at test-time without parameter updates

Model or implementation: Various (e.g., o1 inference strategies)

Learning-to-Reason Regime (Regime Classification)

Internalize reasoning via training

Model or implementation: Various (e.g., DeepSeek-R1, AlphaMath)

Single-Agent Architecture (Architecture Classification)

Single LLM interacting with environment/tools

Model or implementation: N/A

Multi-Agent Architecture (Architecture Classification)

Multiple LLMs communicating to solve tasks

Model or implementation: N/A

Novel Architectural Elements

Dual-axis taxonomy (Regime × Architecture) replacing the standard linear history of LLM progress
Unified 'Input/Output' abstraction that maps prompt engineering to agent perception and candidate refinement to agent action

Modeling

Base Model: N/A (Survey Paper)

Comparison to Prior Work

vs. Huang & Chang (2023): Covers modern agents and RLMs (o1, R1) which were absent in earlier works
vs. Qiao et al. (2023b): Expands scope beyond prompting to include training regimes (RL) and multi-agent architectures
vs. Besta et al. (2025): Specifically integrates Agentic and Multi-Agent systems into the reasoning taxonomy, whereas Besta et al. focus primarily on standalone RLMs

Limitations

Survey scope excludes detailed analysis of benchmarks and datasets (referred to other works)
Rapidly evolving field means specific SOTA model references (e.g., o1, R1) may effectively become historical baselines quickly
Does not deeply cover non-reasoning aspects of agents (e.g., pure actuation without reasoning)

Reproducibility

Not provided (Survey paper). No specific code or artifacts associated with the survey itself.

📊 Experiments & Results

Evaluation Setup

Qualitative analysis and taxonomy construction

Metrics:

Statistical methodology: Not applicable

Main Takeaways

Reasoning has shifted from a prompt-engineering problem to a 'system 2' search/RL problem (Inference Scaling and Learning-to-Reason)
Agentic systems are the architectural evolution of reasoning, moving from internal thought chains to external tool use and multi-agent coordination
There is a convergence between inference scaling and learning: high-quality trajectories found via inference search are increasingly used to train models (learning-to-reason), closing the loop

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Chain-of-Thought (CoT) prompting
Basic understanding of Reinforcement Learning (RL)
Knowledge of LLM inference processes (decoding)

Key Terms

Inference Scaling: Allocating more computational resources at test-time (e.g., generating many candidates, tree search) to improve performance without retraining the model

Learning-to-Reason: Enhancing reasoning capabilities through dedicated training (SFT or RL) so the model internalizes the thinking process, reducing reliance on costly inference compute

Agentic Systems: AI systems that exhibit interactivity and autonomy, using tools (environment) or communicating with other agents to refine reasoning

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

PPO: Proximal Policy Optimization—a reinforcement learning algorithm commonly used to align LLMs with human preferences or reasoning objectives

GRPO: Group Relative Policy Optimization—a recent RL algorithm (used in DeepSeek-R1) that optimizes reasoning without a separate critic model by comparing a group of outputs

ORM: Outcome Reward Model—a verifier that evaluates the final answer of a reasoning chain

PRM: Process Reward Model—a verifier that evaluates the correctness of each intermediate step in a reasoning chain

SFT: Supervised Fine-Tuning—training a model on labeled examples (e.g., question + correct reasoning trace)

AGI: Artificial General Intelligence—AI systems with broad, human-like cognitive abilities

DPO: Direct Preference Optimization—a method to align models to preferences without an explicit reward model