← Back to Paper List

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, S. Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov
The AGI Company (MultiOn), Stanford University
arXiv.org (2024)
Agent Reasoning RL Benchmark

📝 Paper Summary

Self-evolving Agentic reasoning RL-based
Agent Q improves autonomous web agents by combining Monte Carlo Tree Search with self-critique to generate data, then learning from both successful and failed trajectories using offline preference optimization.
Core Problem
LLM agents trained via behavior cloning on expert demonstrations suffer from compounding errors and fail to generalize because they lack exposure to negative examples and exploration data.
Why it matters:
  • Current agents are unreliable in long-horizon tasks (e.g., booking flights) because a single error often leads to total task failure.
  • Supervised fine-tuning on expert data does not teach agents how to recover from mistakes or explore alternative paths.
  • Online reinforcement learning (like PPO) is dangerous and costly in real-world web environments (e.g., accidentally completing a purchase).
Concrete Example: In a reservation task, a standard agent might click a wrong date and continue futility. Agent Q uses search to explore actions, realizes the date error via self-critique, backtracks, and effectively learns 'don't click X when Y is desired' from this negative experience.
Key Novelty
Guided Search + DPO for Agents (Agent Q)
  • Uses Monte Carlo Tree Search (MCTS) guided by AI self-critique to explore the web environment and generate a tree of diverse trajectories (both successful and failed).
  • Constructs preference pairs from these search trees (ranking successful branches over failed ones) to train the model offline using Direct Preference Optimization (DPO).
  • The resulting model internalizes the 'search' capability, allowing it to predict better long-horizon actions zero-shot or perform even better with online search.
Evaluation Highlights
  • Boosts Llama-3 70B zero-shot success rate from 18.6% to 81.7% (+340% relative) on real-world booking scenarios after one day of data collection.
  • Achieves 95.4% success rate on real-world booking tasks when the trained Agent Q model is equipped with online MCTS.
  • Outperforms average human performance on the WebShop benchmark when equipped with online search.
Breakthrough Assessment
9/10
Significant leap in real-world agent reliability. Effectively solves the 'data scarcity' problem for agents by synthesizing useful data (successes AND failures) via search, then distilling it.
×