AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework

📝 Paper Summary

Online Reinforcement Learning for LLM Agents Multi-Turn Agent Interactions Multi-Task Learning

AgentRL provides a scalable infrastructure and stable algorithms for training LLM agents across multiple turns and tasks by decoupling generation from training and normalizing rewards.

Core Problem

Training LLM agents with RL in multi-turn, multi-task settings suffers from poor exploration, unstable optimization due to varying reward scales, and inefficient synchronous data collection.

Why it matters:

Existing RL methods like PPO struggle with the sparse rewards and long horizons typical of agentic tasks
Synchronous training pipelines leave GPUs idle while waiting for slow environment interactions, severely limiting throughput
Multi-task training often fails because easier tasks with higher rewards dominate the gradient updates, causing the model to ignore harder tasks

Concrete Example: In a web shopping task, an agent might need 10+ steps to checkout. Standard PPO might fail to explore the final 'purchase' action because it over-exploits early, easy steps. Meanwhile, if trained jointly with a simple search task, the high rewards from search overwhelm the learning signal from the complex shopping task.

Key Novelty

Asynchronous Generation-Training Pipeline with Cross-Policy Sampling

Decouples inference (rollout) from learning (update) into separate asynchronous processes, maximizing GPU utilization unlike standard synchronous PPO
Introduces cross-policy sampling: instead of sampling only from the current policy, it mixes samples from a pool of historical and external policies to improve exploration in sparse-reward settings
Applies task advantage normalization to balance learning updates across different tasks with varying reward scales

Architecture

The AgentRL framework architecture showing the decoupled Actor-Learner structure.

Evaluation Highlights

Outperforms GPT-4o on WebShop (success rate) using Llama-3-8B-Instruct trained with AgentRL
+13.3% success rate improvement on OSWorld compared to PPO baseline
Multi-task training matches the performance of task-specific experts (within margin of error) while using a single shared model

Breakthrough Assessment

8/10

Significant contribution to infrastructure and stability for online RL, addressing the key bottleneck of throughput in agent training. Adopted by AutoGLM.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn, multi-task Reinforcement Learning (RL) for LLM agents

Inputs: Task description and current observation o_t

Outputs: Action a_t (text or function call)

Pipeline Flow

Environment Interaction (Rollout Workers)
Trajectory Storage (Replay Buffer)
Model Update (Learner)

System Modules

Actor (Rollout Worker)

Interacts with environments to generate trajectory data using the current or diverse policies

Model or implementation: Llama-3-8B-Instruct (or similar open LLMs)

Replay Buffer

Stores trajectories (observations, actions, rewards) asynchronously received from actors

Model or implementation: In-memory queue / database

Learner

Updates the policy and value networks using collected data

Model or implementation: Llama-3-8B-Instruct (policy) + Value Head

Novel Architectural Elements

Fully asynchronous generation-training pipeline tailored for LLM agents (distinct from standard synchronous PPO implementations like TRL)
Unified function-call based API interface for heterogeneous environments (web, OS, database)

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: Asynchronous PPO with V-Trace correction

Objective Functions:

Purpose: Maximize expected reward while staying close to the behavior policy.

Formally: V-Trace corrected policy gradient objective.
Purpose: Normalize advantages across different tasks to ensure balanced learning.

Formally: A_hat = (A - mean(A_task)) / (std(A_task) + epsilon)
Purpose: Encourage exploration by sampling from diverse policies.

Formally: Sampling distribution P_sample is a mixture of current policy pi and pool policies mu.

Adaptation: Full fine-tuning (assumed, as LoRA not explicitly detailed as sole method)

Trainable Parameters: Full model (implied by magnitude of results and infrastructure description)

Training Data:

WebShop
SciWorld
TextCraft
OSWorld
InterCode-SQL

Key Hyperparameters:

learning_rate: 1e-6 to 5e-6
batch_size: 128 (global)
ppo_clip_epsilon: 0.1
+ 4 more
gae_lambda: 1.0
gae_gamma: 0.99
kl_coef: 0.01
max_grad_norm: 1.0

Compute: Experiments run on clusters with 8 to 64 A100/H100 GPUs depending on scale

Comparison to Prior Work

vs. Standard PPO: AgentRL uses asynchronous execution + V-Trace off-policy correction to handle latency, whereas PPO is strictly synchronous/on-policy
vs. IMPALA: Adapts the async architecture specifically for LLMs (handling heavy inference/generation costs) and adds task advantage normalization
vs. RFT/Expert Iteration: AgentRL focuses on online RL with exploration via cross-policy sampling, rather than just learning from best-of-n samples
+ 1 more
vs. Arcee [not cited in paper]: Arcee focuses on merging models, whereas AgentRL focuses on the training infrastructure itself

Limitations

Requires significant computational resources (multiple high-end GPUs) to realize the benefits of the asynchronous pipeline
Complexity of managing distributed infrastructure (actors, learners, controller) is higher than simple training scripts
Cross-policy sampling adds memory overhead to maintain the policy pool

Reproducibility

Code: https://github.com/THUDM/AgentRL

Code is open-sourced at https://github.com/THUDM/AgentRL. The framework supports containerized environment development. Specific model weights for the reported Llama-3-8B results are not explicitly linked in the text but likely available via the repo.

📊 Experiments & Results

Evaluation Setup

Agentic tasks across web browsing, OS control, coding, and text games.

Benchmarks:

WebShop (Web e-commerce interaction)
SciWorld (Science experiment simulation)
OSWorld (Operating System control (GUI/CLI))
TextCraft (Minecraft-style crafting (text))
InterCode-SQL (SQL database interaction)

Metrics:

Success Rate (SR)
Score (Reward)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on single-task performance showing AgentRL improvements over baselines.
WebShop	Score	44.2	62.4	+18.2
SciWorld	Score	38.1	59.2	+21.1
OSWorld	Success Rate	4.1	17.4	+13.3
Comparison against proprietary models.
WebShop	Success Rate	58.2	62.4	+4.2

Experiment Figures

Ablation study of different components (Cross-policy sampling, Task normalization) on performance.

Main Takeaways

AgentRL consistently outperforms standard PPO baselines across diverse domains (Web, OS, Science, Coding)
Multi-task training with AgentRL achieves performance comparable to training separate expert models for each task, indicating effective interference mitigation via task advantage normalization
The asynchronous architecture significantly improves training throughput compared to synchronous baselines
Cross-policy sampling is crucial for exploration in sparse reward environments where standard exploration fails

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO algorithm)
LLM Agent Architectures
Distributed System Design (Asynchronous processing)

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that updates policies constrained by a trust region to ensure stability

Cross-Policy Sampling: A strategy where actions are sampled not just from the current policy but from a diverse pool of policies (including older versions or different models) to encourage exploration

Task Advantage Normalization: Normalizing the advantage values (how much better an action is than expected) specifically within each task's statistics to prevent tasks with large raw rewards from dominating the gradient

Asynchronous Pipeline: A system design where data generation (rollout) and model training happen in parallel processes connected by a buffer, rather than waiting for each other

Advantage: In RL, a value measuring how much better a specific action is compared to the average action in that state

On-policy: RL algorithms that require data generated by the *current* version of the model being trained (strictly)

Off-policy: RL algorithms that can learn from data generated by older or different policies

V-Trace: A correction method used in off-policy RL to adjust for the difference between the behavior policy (that generated data) and the target policy (being learned)

AutoGLM: A foundation agent framework mentioned as utilizing the AgentRL system

Containerized Environment: Running agent tasks (like web browsing) inside isolated Docker containers to ensure safety and reproducibility