WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

📝 Paper Summary

Web agents RL-based

WebAgent-R1 trains web agents via end-to-end reinforcement learning directly from online interactions, using dynamic context compression and parallel rollouts to manage long horizons and sparse rewards.

Core Problem

Training effective web agents is challenging because web tasks involve long-horizon decision-making in dynamic environments where early actions (like logging out) irreversibly change the state, making offline data unreliable.

Why it matters:

Existing RL for agents (e.g., math reasoning) focuses on single-turn tasks, failing to address the complexity of multi-step web interactions
Prior web agents rely on prompting or behavior cloning, which lack exploration capabilities, or off-policy RL, which suffers from mismatch between the training data and current policy
Complex dependencies (e.g., needing to log in before editing a profile) require agents to learn adaptively from their own current actions rather than static datasets

Concrete Example: If an agent is tasked to log out and then edit a profile, these tasks are interdependent. An agent trained on off-policy data (where it never logged out) might try to access the profile after logging out, failing the task because it doesn't understand it lost access. End-to-end RL allows the agent to experience this failure and adjust.

Key Novelty

Multi-turn Group Relative Policy Optimization (M-GRPO) with Dynamic Context Compression

Extends GRPO to multi-turn settings by generating groups of parallel trajectories online and updating the policy based on binary task success rewards
Implements dynamic context compression that simplifies historical observations (e.g., replacing old HTML with a placeholder) to fit long interaction histories into memory
Uses parallel trajectory rollouts across multiple independent browser instances to efficiently gather diverse experience data for the group-based updates

Architecture

The end-to-end multi-turn RL framework. It shows the flow from Environment -> SFT Policy -> M-GRPO with two key mechanisms: Dynamic Context Compression and Parallel Trajectory Rollout.

Evaluation Highlights

Boosts Qwen-2.5-3B success rate from 6.1% (base) to 33.9% on WebArena-Lite, significantly outperforming the Behavior Cloning baseline (20.0%)
Llama-3.1-8B improves from 8.5% (base) to 44.8% with WebAgent-R1, surpassing OpenAI o3 (39.4%) and GPT-4o (16.4%)
Demonstrates effective test-time scaling: increasing interaction turns consistently improves success rates across prompting, SFT, and RL methods

Breakthrough Assessment

8/10

Significant performance jump over strong proprietary models (like o3) using much smaller open models. Successfully applies on-policy RL to complex, long-horizon web tasks, addressing key memory and stability challenges.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP)

Inputs: Current state s_t (text-only HTML content) and interaction history h_t

Outputs: Action a_t from predefined action space (e.g., click, type)

Pipeline Flow

Environment Observation (HTML)
Dynamic Context Compression
Policy Model (Action Generation)
Environment Execution

System Modules

Dynamic Context Compressor

Reduces memory usage by replacing past HTML observations with short templates while keeping action history

Model or implementation: Rule-based mechanism

Web Agent Policy

Generates the next web action (and optionally reasoning thoughts) based on compressed history

Model or implementation: Qwen-2.5-3B or Llama-3.1-8B

Novel Architectural Elements

Parallel Trajectory Rollout mechanism: synchronizes multiple browser instances to generate groups of trajectories for M-GRPO updates
Dynamic loss masking tied to context compression: ensures gradients are only calculated on actions despite changing context representations

Modeling

Base Model: Qwen-2.5-3B and Llama-3.1-8B

Training Method: Multi-turn Group Relative Policy Optimization (M-GRPO)

Objective Functions:

Purpose: Optimize policy to maximize binary success reward while staying close to old policy.

Formally: Minimize loss L_M-GRPO using importance sampling ratio r_t(theta) and group relative advantage A_{i,j}, clipped within epsilon.

Adaptation: Full fine-tuning

Training Data:

647 WebArena-Lite tasks used for RL training
165 WebArena-Lite tasks reserved for evaluation

Key Hyperparameters:

clip_epsilon: Not explicitly reported in the paper
beta: Not explicitly reported in the paper
group_size_G: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DigiRL/WebRL: Uses on-policy end-to-end RL (M-GRPO) instead of offline or iterative off-policy methods, avoiding replay buffers and trajectory filtering
vs. o3/GPT-4o: Specialized training on web data via RL allows smaller open models (8B) to outperform massive proprietary models on specific web tasks
vs. DeepSeek-R1: Adapts the GRPO intuition from math reasoning to multi-turn, interactive web environments [not cited in paper as direct baseline, but methodologically related]

Limitations

Relies on rule-based binary rewards, which may be sparse or insufficient for very complex tasks
Evaluation limited to WebArena-Lite and WebVoyager; broader web coverage remains to be tested
Long-CoT reasoning provided smaller gains in RL compared to standard SFT initialization
Context compression is lossy; might discard subtle but important details from past pages

Reproducibility

Code: https://github.com/weizhepei/WebAgent-R1

Code and artifacts available at https://github.com/weizhepei/WebAgent-R1. Uses WebArena environment. Specific hyperparameters (LR, batch size) not detailed in main text.

📊 Experiments & Results

Evaluation Setup

Web navigation and task completion on live/simulated websites

Benchmarks:

WebArena-Lite (Realistic web tasks (Reddit, GitLab, Shopping, etc.))
WebVoyager (OOD web tasks (different domains))

Metrics:

Task Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on WebArena-Lite showing significant improvement of WebAgent-R1 over baselines and proprietary models.
WebArena-Lite	Success Rate	6.1	33.9	+27.8
WebArena-Lite	Success Rate	8.5	44.8	+36.3
WebArena-Lite	Success Rate	39.4	44.8	+5.4
WebArena-Lite	Success Rate	16.4	44.8	+28.4
Ablation study demonstrating the necessity of Behavior Cloning (BC) initialization.
WebArena-Lite	Success Rate	5.5	33.9	+28.4
Ablation on Long Chain-of-Thought (CoT) integration.
WebArena-Lite	Success Rate	30.3	33.9	-3.6
OOD Evaluation on WebVoyager.
WebVoyager	Success Rate	24.0	40.0	+16.0

Experiment Figures

Training dynamics of WebAgent-R1 showing Reward, Trajectory Length, and Interaction Rounds over training steps.

Effect of scaling test-time interaction turns on success rate.

Main Takeaways

End-to-end on-policy RL (M-GRPO) is highly effective for web agents, surpassing off-policy methods and proprietary models like o3 and GPT-4o.
Behavior Cloning (SFT) is a critical warm-up step; RL without it (WebAgent-R1-Zero) fails completely due to lack of basic web interaction skills.
Thinking-based prompting (CoT) improves performance but constrains exploration during RL; standard SFT allows for more flexible policy improvement during RL.
Test-time scaling via increased interactions (more turns) is a viable strategy for improving web agent success, distinct from simply generating longer responses.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (POMDP, Policy Gradient)
Language Models (SFT, prompting)
Web technologies (HTML structure, DOM)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled trajectories rather than using a learned value function

POMDP: Partially Observable Markov Decision Process—a decision-making framework where the agent cannot see the entire state of the environment (e.g., hidden backend state of a website)

Behavior Cloning (BC): A supervised learning approach where an agent learns to mimic expert actions from a dataset of demonstrations

M-GRPO: Multi-turn Group Relative Policy Optimization—the paper's extension of GRPO to handle sequential decisions over multiple turns in an environment

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

Context Compression: Reducing the length of past inputs (e.g., HTML pages) in the prompt to save memory while retaining essential history

SFT: Supervised Fine-Tuning—training a model on labeled examples (here, expert trajectories) before applying RL

WebArena-Lite: A curated, human-verified subset of the WebArena benchmark, designed for more reliable evaluation of web agents