The landscape of agentic reinforcement learning for LLMs: A survey

📝 Paper Summary

Agentic AI Reinforcement Learning for LLMs

Agentic Reinforcement Learning reframes LLMs from static text generators optimized on single-turn preferences into autonomous agents optimized via RL to plan, use tools, and reason in dynamic, partially observable environments.

Core Problem

Current Preference-Based Reinforcement Fine-Tuning (PBRFT) treats LLMs as passive text emitters in degenerate single-step environments, failing to capture the long-horizon decision-making, tool use, and stateful memory required for autonomous agents.

Why it matters:

Static alignment (RLHF/DPO) overlooks sequential decision-making crucial for realistic tasks like coding or web navigation
Current studies examine isolated capabilities (e.g., just tool use) without a unified framework connecting them to RL optimization
Inconsistent terminology and protocols across 500+ works make it difficult to compare progress in building general-purpose agents

Concrete Example: In PBRFT, a model is optimized to output a single correct text response to a prompt. In Agentic RL, an agent must issue a 'search' action, observe the result, update its internal state, and then decide to 'summarize' or 'search again'—a multi-step process where rewards are often delayed until the final goal is achieved.

Key Novelty

Formal Unification of Agentic RL

Formalizes the shift from PBRFT (single-step MDP) to Agentic RL (Partially Observable MDP) where actions include both text and environmental interactions
Proposes a twofold taxonomy organizing the field by core capabilities (planning, tool use, memory) and downstream applications
Consolidates over 500 works to distinguish how RL transforms static heuristic modules into adaptive, robust agentic behaviors

Architecture

A conceptual comparison between LLM RL (PBRFT) and Agentic RL frameworks

Evaluation Highlights

Synthesizes over 500 recent works covering planning, tool use, memory, and reasoning
Formalizes the transition from degenerate single-step MDPs (T=1) in PBRFT to long-horizon POMDPs in Agentic RL
Categorizes RL algorithms into four families (REINFORCE, PPO, DPO, GRPO) and analyzes their specific utility for agentic tasks

Breakthrough Assessment

9/10

This is a foundational survey that defines and formalizes the emerging field of Agentic RL. It provides the necessary theoretical grounding (MDP vs. POMDP) to distinguish agents from standard LLMs, likely becoming a standard reference.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP)

Inputs: Initial state s0 (prompt) and subsequent observations o_t based on environment state s_t

Outputs: Action a_t, which can be text (A_text) or tool/environment interaction (A_action)

Pipeline Flow

Survey Framework: Conceptualization -> Capabilities -> Applications -> Ecosystem

System Modules

Formalization Module

Defines Agentic RL as a POMDP and contrasts it with PBRFT's degenerate MDP

Model or implementation: Mathematical Abstraction

Capability Taxonomy

Categorizes research by agentic modules

Model or implementation: Conceptual Framework

Novel Architectural Elements

Unified Action Space Definition: Formally modeling the output space as the union of text tokens (A_text) and functional actions (A_action) within the same RL policy
Taxonomy structure: Splitting the field into Capability-centered vs. Application-centered views

Modeling

Base Model: N/A (Survey paper, covers various models like OpenAI o1, DeepSeek-R1)

Training Method: Survey of methods (PPO, GRPO, REINFORCE, DPO)

Objective Functions:

Purpose: PBRFT Objective.

Formally: Maximize expected reward of single-turn response r(s_0, a_0) without discount factor.
Purpose: Agentic RL Objective.

Formally: Maximize expected discounted cumulative reward sum(gamma^t * r(s_t, a_t)) over horizon T.
Purpose: REINFORCE.

Formally: Gradient ascent on log-probability weighted by return G_t.
Purpose: PPO.

Formally: Clipped surrogate objective ensuring policy updates stay within a trust region.
Purpose: GRPO.

Formally: Advantage estimation via group-relative rewards (r_i - mean(r)) / std(r) without a critic network.

Adaptation: Survey covers various adaptation methods (LoRA, Full Fine-Tuning)

Trainable Parameters: Variable across surveyed papers

Training Data:

Dynamic environments (WebArena, OSWorld)
Static preference datasets (for PBRFT contrast)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Luo et al.: This paper focuses specifically on the Reinforcement Learning mechanism as the driver of agentic capability, rather than general architectural patterns
vs. Cao et al.: Distinguishes 'LLMs for RL' (LLM as helper) from 'Agentic RL' (LLM as the policy itself being optimized)
vs. Plaat et al.: Provides a formal mathematical grounding (MDP vs POMDP) to distinguish the paradigms

Limitations

Survey nature limits experimental validation of new claims; relies on synthesizing existing work
Scope is vast (500+ papers), potentially sacrificing depth on specific niche algorithms
Rapidly evolving field means some cited SOTA methods (e.g., DeepSeek-R1) are very recent and their long-term impact is not yet fully settled

Reproducibility

Code: https://github.com/OpenRL/Agentic-RL-Survey

The paper provides a consolidated list of open-source environments and frameworks at https://github.com/OpenRL/Agentic-RL-Survey to facilitate reproducibility in the field.

📊 Experiments & Results

Evaluation Setup

Theoretical Analysis and Literature Review

Benchmarks:

WebArena (Web Navigation)
OSWorld (Operating System Control)
SWE-bench (Software Engineering)

Metrics:

Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Agentic RL fundamentally differs from standard LLM RL (PBRFT) by introducing partial observability, temporal extension (T > 1), and dynamic state transitions
The field is moving from static datasets to dynamic environments where rewards are sparse and delayed
GRPO is emerging as a critical algorithm for reasoning models (e.g., DeepSeek-R1) due to its efficiency in eliminating the value network
A unified action space combining text generation and functional tool invocation is essential for effective agentic policies

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning fundamentals (MDPs, Policies, Rewards)
Large Language Model post-training (SFT, RLHF)
Basic understanding of LLM agents (Tool use, Planning)

Key Terms

PBRFT: Preference-Based Reinforcement Fine-Tuning—optimizing LLMs on fixed static datasets to align with human preferences (e.g., RLHF, DPO)

Agentic RL: Reinforcement Learning applied to LLMs acting as autonomous agents in dynamic environments, optimizing for long-term task completion rather than just single-turn text quality

POMDP: Partially Observable Markov Decision Process—a mathematical framework where an agent makes decisions based on incomplete observations of the world state

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs to each other, eliminating the need for a separate value network (critic)

DPO: Direct Preference Optimization—a method optimizing the policy directly on preference data without an explicit reward model

PPO: Proximal Policy Optimization—an on-policy RL algorithm that constrains updates to ensure stability

SFT: Supervised Fine-Tuning—training models on labeled examples

RAG: Retrieval-Augmented Generation—enhancing LLM inputs with external data

degenerate MDP: An MDP where the time horizon T=1, effectively reducing the problem to a contextual bandit or single-step supervised learning task