Agent Learning via early experience

📝 Paper Summary

Agentic AI Imitation Learning Reward-free learning

Early Experience trains agents using the future states resulting from their own exploratory actions as supervision, enabling improvement without external rewards or additional human data.

Core Problem

Training language agents is difficult because many environments lack verifiable rewards for Reinforcement Learning (RL), while Supervised Fine-Tuning (SFT) on expert data fails to teach agents how to recover from errors or handle unseen states.

Why it matters:

Scaling high-quality human demonstrations is expensive and captures only a narrow range of scenarios.
Current SFT agents are passive; they never observe the consequences of non-expert actions, making them brittle to distribution shifts.
Many real-world tasks (e.g., open-ended web navigation) lack the reliable reward signals required for traditional RL.

Concrete Example: In WebShop, an agent trained only on successful purchases might not know what to do if it accidentally clicks a wrong button. Without early experience, it never sees the resulting error page or state change, so it cannot learn to correct its course.

Key Novelty

Early Experience Paradigm

Treats the agent's own interaction traces (actions and resulting future states) as direct supervision signals without needing external rewards.
Implicit World Modeling: Trains the policy to predict the next state given a state-action pair, forcing the agent to internalize environment dynamics.
Self-Reflection: Uses an LLM to generate 'internal monologues' explaining why an expert action is better than the agent's own sampled alternative, based on the observed outcomes of both.

Evaluation Highlights

Achieves +18.4% success rate improvement on WebShop (Llama-3.2-3B) over imitation learning using Implicit World Modeling.
Self-Reflection yields +15.0% success rate gain on TravelPlanner (Llama-3.1-8B) by improving long-horizon reasoning.
Checkpoints initialized with Early Experience achieve higher post-RL performance ceilings than imitation learning starts when rewards are available (e.g., +4.4% on ALFWorld with GRPO).

Breakthrough Assessment

8/10

Strong conceptual bridge between imitation and RL. Demonstrates that reward-free exploration can significantly boost performance across diverse benchmarks, effectively addressing the 'sparse reward' bottleneck in agent training.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) M = (S, A, T, R, γ, ρ0) where reward R is unavailable or unverifiable during the early experience phase.

Inputs: Current state s (textual description, HTML, tool output) and expert trajectory dataset Dexpert.

Outputs: Next action a (tool call, navigation command, or text generation).

Pipeline Flow

Data Collection: Sample alternative actions from policy at expert states
Environment Interaction: Execute alternative actions to get next states (rollouts)
Label Generation: Construct supervision (next-state prediction or reflection generation)
Training: Fine-tune model on combined expert and rollout data

System Modules

Rollout Generator

Generate alternative experiences by sampling actions from the current policy at states visited by experts.

Model or implementation: Base LLM (e.g., Llama-3.1-8B)

Implicit World Modeler (IWM) (Training Objective)

Train policy to predict dynamics. Uses next-token prediction on (state, action) -> next_state.

Model or implementation: Same parameters as Policy

Self-Reflector (SR) (Training Objective)

Train policy to reason about sub-optimality. Uses LLM to generate rationale comparing expert vs. alternative outcomes.

Model or implementation: Same parameters as Policy

Novel Architectural Elements

Dual-use of policy network: The same LLM parameters are used for decision making (policy) and dynamics prediction (implicit world model) via instruction tuning.
Outcome-grounded reflection: Reflection data is generated based on actual execution outcomes (future states) rather than just static text analysis.

Modeling

Base Model: Llama-3.2-3B, Qwen-2.5-7B, Llama-3.1-8B (Instruction Tuned versions)

Training Method: Supervised Fine-Tuning (Next-Token Prediction)

Objective Functions:

Purpose: Implicit World Modeling.

Formally: L_IWM = - sum log p(s_next | s, a) over rollout data.
Purpose: Self-Reflection.

Formally: L_SR = - sum log p(explanation, a_expert | s) over reflection data.
Purpose: Imitation Learning (Base).

Formally: L_IL = - sum log p(a_expert | s) over expert data.

Adaptation: Full fine-tuning (except 70B models use LoRA)

Trainable Parameters: Full model for <8B; LoRA adapters for 70B

Training Data:

Expert trajectories from benchmarks (e.g., 21k pairs for ALFWorld, 15k for WebShop).
Rollout data generated by sampling K alternative actions per expert state.

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
epochs: Variable (matched to imitation learning budget per task)

Compute: Up to 8 H100 GPUs

Comparison to Prior Work

vs. STaR: Early Experience uses negative/alternative outcomes for supervision, whereas STaR only learns from successful self-generated traces.
vs. Long CoT: Early Experience fine-tunes the model to internalize lessons from exploration, avoiding the drift/hallucination common in pure inference-time prompting.
vs. World Models: Implicitly models dynamics within the policy weights via auxiliary tasks rather than training a separate simulator module for planning.
+ 1 more
vs. PPO/RL [not cited in paper]: Does not require a reward function, making it applicable in reward-free settings unlike standard RL.

Limitations

Depends on the availability of seed expert trajectories to start exploration.
Self-reflection quality depends on the LLM's capability to reason about outcomes.
Computationally more expensive than simple SFT due to rollout generation.
Performance gains vary across environments (smaller gains on open-ended web tasks like WebArena).

Reproducibility

Not yet released (paper promises code but no URL provided). Missing: Exact learning rates, batch sizes, and prompt templates for all baselines. Expert trajectory counts are provided.

📊 Experiments & Results

Evaluation Setup

8 diverse environments spanning embodied agents, web navigation, and tool use.

Benchmarks:

ALFWorld (Embodied instruction following)
WebShop (E-commerce web navigation)
ScienceWorld (Scientific experiment simulation)
TravelPlanner (Long-horizon planning)
BFCLv3 (Multi-turn tool use)
Tau-Bench (Customer service API interaction)
SearchQA (Multi-hop QA with search)
WebArena-Lite (Open web navigation)

Metrics:

Success Rate (SR)
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Effectiveness results show Early Experience methods (IWM and SR) consistently outperforming Imitation Learning across diverse benchmarks.
WebShop	Success Rate	41.8	60.2	+18.4
TravelPlanner	Success Rate	17.2	32.2	+15.0
ScienceWorld	Success Rate	54.7	68.0	+13.3
ALFWorld	Success Rate	78.1	85.9	+7.8
Out-of-domain (OOD) generalization tests show that models trained with Early Experience adapt better to unseen scenarios than imitation baselines.
SearchQA (OOD)	F1	40.5	45.4	+4.9
BFCLv3 (OOD)	Success Rate	5.3	13.8	+8.5
RL Warm-up experiments demonstrate that Early Experience checkpoints serve as better initializations for GRPO than standard Imitation Learning checkpoints.
ALFWorld	Success Rate	91.4	99.2	+7.8

Main Takeaways

IWM (Implicit World Modeling) excels in environments with stable dynamics (e.g., WebShop), helping agents internalize transition rules.
SR (Self-Reflection) dominates in tasks requiring complex reasoning and long-horizon planning (e.g., TravelPlanner), where logic errors are more common than dynamics errors.
Early Experience reduces data requirements: WebShop agents match full-data imitation performance using only 1/8th of the expert data.
Scaling benefits: Gains persist across model sizes (3B to 70B), with Early Experience consistently shifting the scaling curve upwards.

📚 Prerequisite Knowledge

Prerequisites

Imitation Learning / Behavior Cloning
Reinforcement Learning basics (MDPs, exploration)
Language Model Fine-tuning (Next-token prediction)

Key Terms

Early Experience: A training paradigm where agents learn from the future states generated by their own actions, using them as supervision without external rewards.

Implicit World Modeling: Training the policy to predict the next state (token sequence) given a current state and action, helping it internalize environment dynamics.

Self-Reflection: A method where the agent compares its own sampled action to an expert action, using the observed outcomes to generate a natural language explanation of why the expert choice was better.

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to validate that early experience provides a better starting point for RL.

SFT: Supervised Fine-Tuning—training a model on expert demonstrations (also called Imitation Learning).

Rollout: A sequence of interactions generated by the agent acting in the environment.

DOM: Document Object Model—the structural representation of a webpage used in web navigation tasks.

Chain-of-Thought: Intermediate reasoning steps generated by the model before producing the final action.