← Back to Paper List

The landscape of agentic reinforcement learning for LLMs: A survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, ..., Philip Torr, Lei Bai
University of Oxford, Shanghai AI Laboratory, National University of Singapore
arXiv, 9/2025 (2025)
Agent RL Reasoning Benchmark

📝 Paper Summary

Agentic AI Reinforcement Learning for LLMs
Agentic Reinforcement Learning reframes LLMs from static text generators optimized on single-turn preferences into autonomous agents optimized via RL to plan, use tools, and reason in dynamic, partially observable environments.
Core Problem
Current Preference-Based Reinforcement Fine-Tuning (PBRFT) treats LLMs as passive text emitters in degenerate single-step environments, failing to capture the long-horizon decision-making, tool use, and stateful memory required for autonomous agents.
Why it matters:
  • Static alignment (RLHF/DPO) overlooks sequential decision-making crucial for realistic tasks like coding or web navigation
  • Current studies examine isolated capabilities (e.g., just tool use) without a unified framework connecting them to RL optimization
  • Inconsistent terminology and protocols across 500+ works make it difficult to compare progress in building general-purpose agents
Concrete Example: In PBRFT, a model is optimized to output a single correct text response to a prompt. In Agentic RL, an agent must issue a 'search' action, observe the result, update its internal state, and then decide to 'summarize' or 'search again'—a multi-step process where rewards are often delayed until the final goal is achieved.
Key Novelty
Formal Unification of Agentic RL
  • Formalizes the shift from PBRFT (single-step MDP) to Agentic RL (Partially Observable MDP) where actions include both text and environmental interactions
  • Proposes a twofold taxonomy organizing the field by core capabilities (planning, tool use, memory) and downstream applications
  • Consolidates over 500 works to distinguish how RL transforms static heuristic modules into adaptive, robust agentic behaviors
Architecture
Architecture Figure Figure 2
A conceptual comparison between LLM RL (PBRFT) and Agentic RL frameworks
Evaluation Highlights
  • Synthesizes over 500 recent works covering planning, tool use, memory, and reasoning
  • Formalizes the transition from degenerate single-step MDPs (T=1) in PBRFT to long-horizon POMDPs in Agentic RL
  • Categorizes RL algorithms into four families (REINFORCE, PPO, DPO, GRPO) and analyzes their specific utility for agentic tasks
Breakthrough Assessment
9/10
This is a foundational survey that defines and formalizes the emerging field of Agentic RL. It provides the necessary theoretical grounding (MDP vs. POMDP) to distinguish agents from standard LLMs, likely becoming a standard reference.
×