DynaWeb: Model-Based Reinforcement Learning of Web Agents

📝 Paper Summary

Web Agents Model-Based Reinforcement Learning (MBRL) World Models

DynaWeb trains web agents by replacing risky live internet interactions with a learned simulator that generates imagined browsing sessions, enabling efficient offline reinforcement learning.

Core Problem

Training web agents via online reinforcement learning is inefficient, expensive, and risky because it requires interacting with the live internet, where mistakes (e.g., unintended purchases) are irreversible.

Why it matters:

Real-world web interaction is slow and unstable due to network latency and transient failures
Safety risks like deleting accounts or submitting data make large-scale exploration dangerous
Existing methods either use world models only for planning (inference-time) or merely for generating static offline data, missing the benefits of active policy optimization

Concrete Example: An agent learning to buy a product might accidentally complete a purchase during training. In DynaWeb, the agent practices on a simulated 'dream' version of the site where clicking 'buy' only updates the simulated state without charging a credit card.

Key Novelty

Imagination-driven On-Policy RL with Expert Interleaving

Trains a 'Web World Model' that acts as a simulator: it takes a current web page and an action, then predicts the next page state (accessibility tree) and reasoning trace
Replaces live web interaction with 'imagined rollouts' generated by this model, allowing the agent to learn from trial-and-error in a safe, synthetic environment
Interleaves these imagined trajectories with real expert demonstrations during training to stabilize learning and prevent the agent from exploiting model inaccuracies

Architecture

The DynaWeb framework showing the interaction between the Agent Policy and the Web World Model during training.

Evaluation Highlights

+17.7% success rate improvement on WebArena using Llama-3-8B-Instruct compared to the base SFT (Supervised Fine-Tuning) model
+4.3% success rate on WebVoyager compared to stronger baselines like WebAgent-R1, despite using significantly less real-world interaction
Demonstrates that a 7B parameter world model can sufficiently simulate web dynamics to improve agent policies without live web access

Breakthrough Assessment

8/10

Significant step in making web agent training scalable and safe. Successfully demonstrates that 'dreaming' (MBRL) works for complex web tasks, reducing the dependency on the live internet.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) defined by (S, A, O, T, R) on web interfaces

Inputs: Natural language task query q and initial web page observation o_1 (accessibility tree)

Outputs: Sequence of actions a_t (click, type, scroll) to achieve the task

Pipeline Flow

Observation Processing
Agent Policy Execution
World Model Simulation (Training only)

System Modules

Agent Policy

Generates reasoning trace and browser action based on current observation and history

Model or implementation: Llama-3-8B-Instruct / Qwen2.5-7B-Instruct

Web World Model

Simulates web dynamics by predicting state changes

Model or implementation: Qwen2.5-7B-Instruct / Llama-3.1-8B-Instruct

Novel Architectural Elements

Decoupled State Prediction: The World Model first predicts natural language 'state changes' (Δ) and then applies them to the current state, rather than predicting the full next state text directly
Hybrid Experience Replay: Training buffer mixes purely imagined trajectories (from World Model) with real expert trajectories (from dataset) to stabilize GSPO

Modeling

Base Model: Llama-3-8B-Instruct and Qwen2.5-7B-Instruct (Agent); Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct (World Model)

Training Method: Group Sequence Policy Optimization (GSPO) on mixed data

Objective Functions:

Purpose: Optimize agent policy using advantage-weighted likelihood of trajectories.

Formally: L(θ) = E[min(r_t(θ)A, clip(r_t(θ), 1-ε, 1+ε)A)] where r_t is the sequence-level likelihood ratio.
Purpose: Train World Model to predict state transitions.

Formally: Standard auto-regressive language modeling loss on transition tuples (o_t, a_t, o_{t+1}) from NNetNav.

Training Data:

World Model Data: Cleaned trajectories from StanfordNLP/NNetNav dataset
Agent RL Data: Mixture of on-policy imagined rollouts and offline expert trajectories

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
clip_epsilon: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: World Model training prompts GPT-4o-mini for data filtering; specific GPU hours not reported

Comparison to Prior Work

vs. WebAgent-R1: DynaWeb uses a learned world model for training (offline/imagined) instead of live environment interaction
vs. WebDreamer: DynaWeb uses the world model for *learning* the policy (MBRL), whereas WebDreamer uses it for *planning* at test time
vs. Dreamer (MBRL general) [not cited in paper]: DynaWeb applies Dreamer-like concepts specifically to text-based web agents with a novel state-change prediction architecture

Limitations

World model accuracy degrades over long horizons, potentially leading to hallucinated states
Requires an initial dataset of real interactions (NNetNav) to train the world model
Text-based accessibility tree representation may miss visual cues present in pixel-based agents
No direct cost comparison (e.g., dollar amount saved) provided against online RL methods

Reproducibility

Code and data are not yet released ('Coming-Soon' in abstract). World model is trained on public NNetNav dataset. Hyperparameters for GSPO are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluation on simulated web tasks measuring task completion success rate

Benchmarks:

WebArena (Realistic web tasks (shopping, reddit, CMS) on self-hosted sites)
WebVoyager (Live web tasks across 15 websites (booking, searching, etc.))

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DynaWeb consistently improves success rates over SFT baselines on both WebArena and WebVoyager benchmarks.
WebArena	Success Rate	13.9	31.6	+17.7
WebArena	Success Rate	19.3	32.1	+12.8
WebVoyager	Success Rate	43.3	54.6	+11.3
WebVoyager	Success Rate	50.3	54.6	+4.3
WebArena	Success Rate	24.5	31.6	+7.1

Experiment Figures

Success rate comparison on WebArena and WebVoyager for various methods.

Main Takeaways

Model-based training works for web agents: Imagined rollouts provide a sufficiently accurate signal to improve policies significantly over SFT.
Hybrid data is crucial: Interleaving real expert trajectories with imagined ones prevents model collapse and stabilizes training (Ablation shows +7.1% gain).
Generalization: DynaWeb shows consistent gains across different base LLMs (Llama-3, Qwen2.5) and benchmarks (WebArena, WebVoyager).
Efficiency: Achieves performance competitive with or better than online RL methods (WebAgent-R1) without the risks and costs of live interaction.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients, Importance Sampling)
Large Language Models (LLMs) for reasoning and simulation
Web automation (Accessibility trees, DOM interaction)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

MBRL: Model-Based Reinforcement Learning—an RL approach where the agent learns a model of the environment's dynamics and uses it to simulate experience

World Model: A learned simulator that predicts the next state of the environment given the current state and an action

Imagined Rollout: A trajectory of states and actions generated entirely by interacting with the world model rather than the real environment

Accessibility Tree: A structured, text-based representation of a web page's user interface elements, used as the observation space for the agent

GSPO: Group Sequence Policy Optimization—an RL algorithm that updates the policy based on the likelihood of entire trajectories (sequences) rather than individual tokens

SFT: Supervised Fine-Tuning—training a model on a dataset of expert demonstrations before applying reinforcement learning

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before producing the final answer or action

WebArena: A benchmark for evaluating web agents on realistic, self-hosted websites

WebVoyager: A benchmark for evaluating web agents on live websites via multimodal interaction