← Back to Paper List

Agent Learning via early experience

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu
Meta Superintelligence Labs, Fundamental AI Research at Meta, The Ohio State University
arXiv (2025)
Agent RL Reasoning Benchmark

📝 Paper Summary

Agentic AI Imitation Learning Reward-free learning
Early Experience trains agents using the future states resulting from their own exploratory actions as supervision, enabling improvement without external rewards or additional human data.
Core Problem
Training language agents is difficult because many environments lack verifiable rewards for Reinforcement Learning (RL), while Supervised Fine-Tuning (SFT) on expert data fails to teach agents how to recover from errors or handle unseen states.
Why it matters:
  • Scaling high-quality human demonstrations is expensive and captures only a narrow range of scenarios.
  • Current SFT agents are passive; they never observe the consequences of non-expert actions, making them brittle to distribution shifts.
  • Many real-world tasks (e.g., open-ended web navigation) lack the reliable reward signals required for traditional RL.
Concrete Example: In WebShop, an agent trained only on successful purchases might not know what to do if it accidentally clicks a wrong button. Without early experience, it never sees the resulting error page or state change, so it cannot learn to correct its course.
Key Novelty
Early Experience Paradigm
  • Treats the agent's own interaction traces (actions and resulting future states) as direct supervision signals without needing external rewards.
  • Implicit World Modeling: Trains the policy to predict the next state given a state-action pair, forcing the agent to internalize environment dynamics.
  • Self-Reflection: Uses an LLM to generate 'internal monologues' explaining why an expert action is better than the agent's own sampled alternative, based on the observed outcomes of both.
Evaluation Highlights
  • Achieves +18.4% success rate improvement on WebShop (Llama-3.2-3B) over imitation learning using Implicit World Modeling.
  • Self-Reflection yields +15.0% success rate gain on TravelPlanner (Llama-3.1-8B) by improving long-horizon reasoning.
  • Checkpoints initialized with Early Experience achieve higher post-RL performance ceilings than imitation learning starts when rewards are available (e.g., +4.4% on ALFWorld with GRPO).
Breakthrough Assessment
8/10
Strong conceptual bridge between imitation and RL. Demonstrates that reward-free exploration can significantly boost performance across diverse benchmarks, effectively addressing the 'sparse reward' bottleneck in agent training.
×