Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

📝 Paper Summary

Multi-agent systems Agentic workflow orchestration Generalist agents

Magentic-One is a multi-agent system where a central Orchestrator dynamically plans, tracks progress via structured ledgers, and routes subtasks to specialized agents (Web, File, Code) to solve complex, open-ended problems.

Core Problem

Existing agentic systems often lack the generality to handle diverse, multi-step tasks that require planning, error recovery, and dynamic tool usage across both web and local file environments.

Why it matters:

Monolithic single-agent approaches struggle with complex, long-horizon tasks requiring distinct skills (e.g., coding vs. browsing)
Rigid workflows cannot adapt to novel errors or changing environments, limiting real-world utility
Evaluation of agentic systems is difficult due to side-effects and stochasticity, requiring rigorous containment and repetition controls

Concrete Example: A user asks for a survey and slide deck of recent AI safety papers. A single agent might fail to navigate the web, download PDFs, read them, *and* write code to generate slides in one context. Magentic-One splits this: WebSurfer finds papers, FileSurfer reads them, Coder writes the slide-generation script, and ComputerTerminal executes it.

Key Novelty

Ledger-based Orchestrator for Multi-Agent Dynamic Routing

Uses a central Orchestrator that maintains two structured ledgers (Task Ledger for overall plan/facts, Progress Ledger for immediate history) to manage short-term memory and planning
Implements a dual-loop workflow: an outer loop for high-level replanning/reflection and an inner loop for step-by-step instruction of specialized agents
Modular design allows adding/removing agents (e.g., WebSurfer, Coder) without altering the core Orchestrator logic or prompt tuning

Architecture

The Magentic-One architecture workflow, illustrating the Orchestrator's interaction with the Task/Progress Ledgers and the specialized agents.

Evaluation Highlights

Achieves 38% completion rate on GAIA benchmark (validation set), statistically competitive with state-of-the-art
Achieves 32.8% completion on WebArena, performing competitively against specialized web-only agents
Attains 27.7% accuracy on AssistantBench, demonstrating capability in realistic user-assistant tasks

Breakthrough Assessment

8/10

Strong empirical results across diverse benchmarks (WebArena, GAIA) using a unified, generalist architecture. The ledger-based orchestration offers a clean, extensible paradigm for multi-agent coordination.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) where a system interacts with a computer environment (web, files, code execution) to satisfy a natural language request

Inputs: Task description and optional file attachments

Outputs: Textual answer and/or a specific environmental state (e.g., a generated file)

Pipeline Flow

Orchestrator (Outer Loop): Initialize/Update Task Ledger (Plan, Facts, Guesses)
Orchestrator (Inner Loop): Check Progress Ledger → Select Next Agent → Issue Instruction
Specialized Agents: Execute Instruction (Web, File, Code) → Return Observation
Orchestrator: Update Progress Ledger; if stuck/looping, return to Outer Loop for replanning

System Modules

Orchestrator

Central controller that plans tasks, delegates to workers, tracks progress via ledgers, and recovers from stalls

Model or implementation: GPT-4o

WebSurfer (Action Execution)

Performs web navigation and extraction

Model or implementation: GPT-4o

FileSurfer (Action Execution)

Navigates local files and previews content

Model or implementation: GPT-4o

Coder (Action Execution)

Writes Python code for data analysis or artifact creation

Model or implementation: GPT-4o

ComputerTerminal (Action Execution)

Executes code and shell commands

Model or implementation: Deterministic (No LLM)

Novel Architectural Elements

Dual-ledger memory system (Task Ledger vs. Progress Ledger) explicitly separating high-level planning from low-level execution history
Nested-loop orchestration: Inner loop for action execution, Outer loop for stall detection and reflective replanning
Dynamic team management: Orchestrator selects agents from a roster based on current context rather than a fixed chain

Modeling

Base Model: GPT-4o

Reproducibility

Code: https://aka.ms/magentic-one

Publicly available at https://aka.ms/magentic-one. Includes implementation of agents and the AutoGenBench evaluation tool. Uses GPT-4o; exact prompts included in repo.

📊 Experiments & Results

Evaluation Setup

Evaluation on three diverse agentic benchmarks focusing on general assistant tasks, web navigation, and data analysis.

Benchmarks:

GAIA (General AI Assistants (Valid Set))
WebArena (Web Navigation and Task Completion)
AssistantBench (Realistic web-based user assistant tasks)

Metrics:

Task Completion Rate (Success Rate)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on GAIA benchmark (Validation Set) showing competitive performance against SOTA.
GAIA (Validation)	Success Rate	34.1	38.0	+3.9
Performance on WebArena benchmark demonstrating strong web navigation capabilities.
WebArena	Success Rate	19.6	32.8	+13.2
Performance on AssistantBench showing ability to handle realistic user queries.
AssistantBench	Accuracy (ACC)	13.9	27.7	+13.8

Experiment Figures

An example trace of Magentic-One solving a benchmark task involving multiple steps.

Main Takeaways

Magentic-One demonstrates strong generalization across three distinct benchmarks (GAIA, WebArena, AssistantBench) without modification to core capabilities.
The multi-agent architecture outperforms single-agent GPT-4o baselines significantly (e.g., on WebArena).
The modular design allows for competitive performance even against agents specialized for specific domains (like web navigation).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting strategies (CoT, ReAct)
Familiarity with autonomous agent architectures (planning, tool use, memory)
Basic knowledge of web automation (DOM, viewports) and code execution sandboxes

Key Terms

Orchestrator: The lead agent responsible for planning, maintaining memory (ledgers), and assigning tasks to other specialized agents

Task Ledger: A structured memory object tracking the overall plan, verified facts, educated guesses, and remaining steps

Progress Ledger: A short-term memory object tracking immediate history, loop detection, and next-step instructions for the inner loop

WebSurfer: A specialized agent controlling a Chromium browser, capable of navigation, clicking, typing, and reading web pages

FileSurfer: A specialized agent for navigating local directories and previewing file contents (markdown-based)

Set-of-Marks: A prompting technique where UI elements in a screenshot are overlaid with numeric labels to allow the LLM to refer to them by ID

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot see the entire state of the world

AutoGen: The underlying multi-agent framework used to implement Magentic-One