Agent S: An Open Agentic Framework that Uses Computers Like a Human

📝 Paper Summary

Memory organization Memory recall Self-evolving Agentic reasoning

Agent S automates complex computer tasks by combining hierarchical planning with external knowledge and internal memory, refining its performance through self-supervised experience accumulation and a specialized interface.

Core Problem

Automating OS-level tasks fails because agents lack domain knowledge for diverse apps, struggle with long-horizon planning, and cannot precisely ground actions on dynamic, non-uniform GUIs.

Why it matters:

Current GUI agents struggle to generalize across the vast range of constantly evolving desktop applications and websites
Long-horizon tasks require tracking progress and intermediate subgoals, which standard flat-planning agents often lose track of
Precise mouse/keyboard control is difficult for MLLMs due to a lack of internal coordinate systems and the need to process dense visual information

Concrete Example: In a long-horizon task like 'create a calendar invite based on an email', a standard agent might successfully open the calendar but fail to copy specific details or lose track of the date after switching windows, whereas Agent S retrieves past successful subtask experiences to guide the specific form-filling steps.

Key Novelty

Experience-Augmented Hierarchical Planning with Continual Memory Update

Decomposes tasks into subtasks where the high-level planner uses 'Narrative Memory' (abstract summaries of past full tasks) and 'Online Web Knowledge' to form a strategy
Low-level workers execute subtasks using 'Episodic Memory' (detailed step-by-step traces) to guide specific actions, updating memory with self-evaluated success/failure summaries

Architecture

The complete Agent S framework, detailing the interaction between the User, Manager, Worker, and the Environment via the ACI.

Evaluation Highlights

Achieves 20.58% success rate on OSWorld, outperforming the baseline by 9.37 percentage points (83.6% relative improvement)
Establishes new state-of-the-art on OSWorld across multiple categories including daily tasks and professional workflows
Generalizes to WindowsAgentArena with 18.2% success rate (vs 13.3% baseline) without explicit adaptation, showing cross-OS robustness

Breakthrough Assessment

8/10

Significant jump in SOTA on the difficult OSWorld benchmark. effectively combines hierarchical planning with a retrieval-based memory system that learns from experience, addressing key bottlenecks in long-horizon GUI automation.

⚙️ Technical Details

Problem Definition

Setting: Autonomous interaction with computer Graphical User Interfaces (GUIs) to solve natural language tasks

Inputs: User task instruction T_u, initial environment observation O_0 (screenshot + accessibility tree)

Outputs: Sequence of primitive actions (e.g., click, type) to execute the task

Pipeline Flow

Input Processing (User Task + Observation)
Manager Planning (Web Search + Narrative Memory Retrieval)
Subtask Delegation (Queue of subtasks)
Worker Execution (Episodic Memory Retrieval + Action Generation)
Environment Interaction (ACI execution)
Self-Evaluation & Memory Update

System Modules

Manager (Planning)

Decompose user task into subtasks using external knowledge and past narrative experience

Model or implementation: MLLM (GPT-4o)

Experience Context Fusion (Planning)

Synthesize web knowledge and narrative memory into a fused guideline

Model or implementation: Not explicitly separate model, likely part of Manager prompt

Worker

Execute specific subtasks by generating grounded actions

Model or implementation: MLLM (GPT-4o)

Self-Evaluator

Assess success of subtasks and full tasks; generate summaries for memory

Model or implementation: MLLM (GPT-4o)

Agent-Computer Interface (ACI)

Ground MLLM outputs to executable actions; augment observation with accessibility tree

Model or implementation: Code/API Layer + OCR

Novel Architectural Elements

Dual-memory architecture separating high-level 'Narrative Memory' (for planning) and low-level 'Episodic Memory' (for execution)
Closed-loop self-supervised exploration phase to bootstrap memory before deployment
Integration of Online Web Knowledge search directly into the hierarchical planning step

Modeling

Base Model: GPT-4o (OpenAI) and Claude-3.5-Sonnet (Anthropic)

Training Method: In-context learning with retrieval (RAG) and memory updates; no gradient updates to the MLLM itself

Compute: Not reported in the paper

Comparison to Prior Work

vs. OSWorld Agent: Agent S adds hierarchical planning, external web knowledge, and persistent memory updates, whereas OSWorld agent is primarily a direct-acting agent.
vs. WindowsAgentArena: Agent S applies the same framework to Windows without specialized adaptation, showing generalization.
vs. ReAct/Standard Agents: Agent S uses a specialized ACI with dual-input (vision + accessibility tree) and separates planning memory from execution memory.

Limitations

Reliance on proprietary MLLMs (GPT-4o) implies high cost and latency
Memory retrieval latency may increase as the memory bank grows large over time
Performance depends on the quality of the Accessibility Tree; if the OS/App tree is broken, grounding suffers
No statistical significance tests reported for the improvements

Reproducibility

Code: https://github.com/simular-ai/Agent-S

Code is publicly available at https://github.com/simular-ai/Agent-S. The paper uses proprietary models (GPT-4o, Claude) which are closed-source dependencies. Specific prompts or the pre-computed memory banks from the exploration phase are mentioned as part of the framework.

📊 Experiments & Results

Evaluation Setup

Evaluate on OS-level tasks requiring interaction with multiple applications (Ubuntu Linux and Windows)

Benchmarks:

OSWorld (Desktop computer tasks (Ubuntu))
WindowsAgentArena (Desktop computer tasks (Windows))

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on OSWorld showing significant improvement over the state-of-the-art baseline.
OSWorld	Success Rate	11.21	20.58	+9.37
Generalization results on WindowsAgentArena showing transfer capability.
WindowsAgentArena	Success Rate	13.3	18.2	+4.9
Ablation study components on OSWorld (Domain Knowledge) subset.
OSWorld (Domain Knowledge subset)	Success Rate	22.22	13.25	-8.97
OSWorld (Domain Knowledge subset)	Success Rate	22.22	18.15	-4.07

Experiment Figures

Success rate comparison between Agent S and the OSWorld Baseline across different task categories (Office, Daily, Professional, etc.).

Main Takeaways

Agent S establishes a new SOTA on OSWorld, nearly doubling the success rate of the baseline.
The framework generalizes well to Windows (WindowsAgentArena) without specific tuning, suggesting the ACI and planning modules are robust.
Ablation studies confirm that both Narrative Memory (internal experience) and Web Knowledge (external info) are crucial, with Narrative Memory having a larger impact on domain-specific tasks.
The ACI's precise grounding mechanisms (hybrid OCR + Accessibility Tree) significantly contribute to the performance gains.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with GUI automation (Accessibility Trees, coordinate grounding)
Basic knowledge of Hierarchical Reinforcement Learning concepts (Manager/Worker hierarchy)

Key Terms

Narrative Memory: High-level, abstract summaries of past full-task experiences used by the Manager for planning

Episodic Memory: Detailed, step-by-step records of subtask execution used by Workers for low-level action generation

ACI: Agent-Computer Interface—an abstraction layer that translates MLLM outputs into precise computer actions and provides grounded observations via accessibility trees

OSWorld: A benchmark environment for evaluating multimodal agents on open-ended computer tasks within a Linux operating system

WindowsAgentArena: A benchmark for evaluating agents on Windows OS tasks

RAG: Retrieval-Augmented Generation—AI systems that answer questions or plan by first searching for relevant documents/memories

Set-of-Mark Prompting: A visual prompting technique where objects in an image are overlaid with numeric tags to help the model reference them

Accessibility Tree: A hierarchical representation of a user interface's elements (buttons, text, etc.) provided by the OS for assistive technologies

IOU: Intersection over Union—a metric used to measure the overlap between two bounding boxes