Agentic Large Language Models, a survey

📝 Paper Summary

Agentic AI Multi-Agent Systems Synthetic Data Generation Reasoning LLMs

This survey organizes Agentic LLMs into a reasoning-acting-interacting taxonomy and posits that agent interactions generate new empirical data to overcome the training data plateau.

Core Problem

Standard LLMs are hitting performance plateaus due to the exhaustion of high-quality static training data, while also suffering from hallucination and limited multi-step reasoning capabilities.

Why it matters:

Training data scarcity: Scaling laws are failing as models run out of new human-generated text to learn from (the 'data wall')
Static models lack grounding: Traditional LLMs cannot verify facts against the real world or update their knowledge after the training cutoff
Complex task failure: Without agentic loops (reasoning/acting), models fail at tasks requiring planning, tool use, or long-term coherence

Concrete Example: In math word problems (e.g., 'Annie has a pie cut into twelve pieces...'), a standard LLM often guesses an incorrect number immediately. An Agentic LLM uses Chain-of-Thought reasoning to break the problem into steps or calls a calculator tool to perform the arithmetic, deriving the correct answer through an autonomous process.

Key Novelty

The Reasoning–Acting–Interacting Taxonomy & Data Flywheel

Proposes a three-layered categorization: Reasoning (internal cognitive optimization), Acting (external tool use and robotics), and Interacting (multi-agent social simulation)
Identifies a 'virtuous circle' where agents interacting with the world generate novel trajectories (experience data) that can be filtered and used to train the next generation of models, solving the data scarcity crisis

Architecture

The 'Virtuous Circle' of Agentic LLMs, illustrating how agent interaction feeds back into model training

Breakthrough Assessment

7/10

A comprehensive survey that timely synthesizes the shift from static LLMs to active agents, offering a clear taxonomy and a compelling argument for agents as data generators.

⚙️ Technical Details

Problem Definition

Setting: Survey and taxonomy construction for the field of Agentic Large Language Models

Inputs: Current literature (mostly 2023-2025) across NLP, robotics, and multi-agent systems

Outputs: Unified taxonomy, definitions, and research agenda

Limitations

Safety and liability risks: Agents taking autonomous actions in the real world (e.g., financial trading, medical advice) pose unsolved legal and safety challenges
Evaluation difficulty: Assessing open-ended agentic behavior and social simulations is harder than static text benchmarks
Dependence on underlying LLM quality: Agent performance is strictly capped by the reasoning capabilities of the base model
Survey scope: Heavily focused on very recent work (2024-2025), reflecting the immaturity and rapid flux of the field

Reproducibility

No replication artifacts mentioned in the paper (Survey paper).

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and their training pipeline (Pretraining, SFT, RLHF)
Reinforcement Learning (RL) concepts
Multi-Agent Systems (MAS)

Key Terms

Agentic LLMs: LLMs that (1) reason, (2) act, and (3) interact, possessing a degree of autonomy to achieve goals

In-context learning: The ability of a model to learn from examples provided in the prompt at inference time without updating weights

VLA: Vision-Language-Action models—models that update weights according to robotic action-feedback sequences

CoT: Chain-of-Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps

SFT: Supervised Fine-Tuning—training a model on labeled examples to specialize it for a task

RLHF: Reinforcement Learning from Human Feedback—aligning models to human preferences using reward signals

Hallucination: When LLMs generate answers that are factually incorrect or ungrounded

LoRa: Low-Rank Optimization—a parameter-efficient fine-tuning technique

DPO: Direct Preference Optimization—an alignment method optimizing preferences without a separate reward model