Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

📝 Paper Summary

OS Agents Benchmarking Multi-modal Agents

WindowsAgentArena provides a scalable, reproducible benchmark for multi-modal agents on Windows OS, solving evaluation latency via Azure parallelization and introducing a baseline agent, Navi.

Core Problem

Existing agent benchmarks focus on Web or Linux, ignoring the dominant Windows OS (73% market share), and suffer from extremely slow serial evaluation times (days) for multi-step tasks.

Why it matters:

Most human productivity occurs on Windows, yet agents are primarily tested on Linux/Web, creating a domain gap.
Sequential evaluation of complex OS tasks is prohibitively slow, hindering rapid iterative development of agents.
Existing benchmarks often lack the realism of a full OS environment where agents must switch between applications and contexts.

Concrete Example: A task might require an agent to 'Make the line spacing of first two paragraphs into double line spacing' in LibreOffice Writer. While humans achieve 74.5% success, current agents struggle (19.5%), often failing to locate the correct menu items or handle window focus.

Key Novelty

Scalable Windows OS Benchmark

Introduces the first extensive environment for Windows tasks (154 tasks across diverse apps) deployable in Docker containers.
Implements a cloud-native parallelization architecture using Azure to reduce full benchmark evaluation time from days to ~20 minutes.
Provides a baseline agent (Navi) utilizing Set-of-Marks prompting to interact with hybrid pixel/accessibility-tree observations.

Architecture

The Navi agent architecture and inference flow.

Evaluation Highlights

Navi (best agent variant) achieves 19.5% success rate on WindowsAgentArena, highlighting the difficulty of the domain compared to the 74.5% human baseline.
Cloud parallelization reduces full benchmark evaluation time to 20 minutes, a massive acceleration compared to serial execution.
Human performance is highest on Windows Utilities (91.7%) and lowest on VLC Player tasks (42.8%), establishing a roofline for agent improvement.

Breakthrough Assessment

8/10

Addresses a massive gap (Windows OS) and a critical bottleneck (eval time). While agent performance is low, the infrastructure enabling scalable Windows research is a significant contribution.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) within a Windows 11 environment

Inputs: Task instruction, clipboard content, session metadata, and screen representation (screenshot, Accessibility Tree, or Set-of-Marks)

Outputs: Executable actions (mouse click, keyboard type, or API calls via a Computer class)

Pipeline Flow

Environment Setup (Docker/Azure)
Observation Processing (Set-of-Marks/OmniParser)
Agent Reasoning (Navi/GPT-4V)
Action Execution (PyAutoGUI/Computer Class)

System Modules

Environment Wrapper

Manages Windows 11 VM state, task setup, and observation capture via Flask server

Model or implementation: Custom Python/Flask bridge to QEMU/KVM

Observation Processor

Augments raw screenshots with Set-of-Marks to ground interactive elements

Model or implementation: OmniParser (or similar pixel-based detector) + Accessibility Tree

Navi Agent

Plans next step and selects action based on annotated observation

Model or implementation: GPT-4V or GPT-4o (via API)

Novel Architectural Elements

Cloud-native parallelization scheme where workers map 1:1 to tasks via Azure Compute Instances
Hybrid observation space combining pixel detectors (OmniParser) and system accessibility trees for Set-of-Marks generation on Windows

Modeling

Base Model: GPT-4V / GPT-4o (Agent Backbone)

Training Method: Prompt Engineering / In-Context Learning only

Adaptation: None (Inference-only)

Trainable Parameters: 0 (Frozen API model)

Compute: Evaluation takes ~20 minutes for full benchmark when parallelized on Azure (approx. 154 parallel nodes implies high burst compute usage)

Comparison to Prior Work

vs. OSWorld: Targets Windows (73% market share) vs Linux; uses Azure cloud parallelization vs local VMWare parallelization
vs. Mind2Web: Interactive execution environment vs static dataset evaluation
vs. UFO: Evaluation on a standardized reproducible benchmark suite vs ad-hoc testing
+ 1 more
vs. Cradle: General computer control agent, but typically evaluated on games or single apps rather than a diverse OS suite [not cited in paper]

Limitations

Cannot provide pre-built Windows VM images due to licensing, requiring user setup effort.
Agent performance (19.5%) is significantly below human level (74.5%), indicating high task difficulty.
Reliance on proprietary models (GPT-4V/4o) for the reference agent limits accessibility for some researchers.
Evaluation cost can be high if running all tasks via paid API calls.

Reproducibility

Code: https://github.com/microsoft/WindowsAgentArena

📊 Experiments & Results

Evaluation Setup

154 diverse Windows tasks evaluated via execution-based verification (checking file states, settings, etc.)

Benchmarks:

WindowsAgentArena (OS Control (Windows)) [New]
Mind2Web (Web Navigation)

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WindowsAgentArena	Success Rate	74.5	19.5	-55.0
WindowsAgentArena	Evaluation Time (Minutes)	1440	20	-1420

Experiment Figures

Success rate breakdown by application domain (e.g., File Explorer, Chrome, LibreOffice).

Main Takeaways

Windows tasks are highly challenging for current VLM agents, with a large gap between agent (19.5%) and human (74.5%) performance.
Task parallelization in the cloud is essential for practical OS agent benchmarking, reducing feedback loops from days to minutes.
Set-of-Marks prompting combined with accessibility trees is the most effective observation representation for the Navi agent.

📚 Prerequisite Knowledge

Prerequisites

Understanding of POMDPs in agentic control
Familiarity with Visual Language Models (VLMs)
Basic knowledge of OS accessibility layers (UI Automation)

Key Terms

POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment

Set-of-Marks: A prompting technique where interactive elements on a screen are overlaid with numeric tags (marks), allowing a vision model to reference specific UI elements by ID

UI Automation tree: A hierarchical representation of the user interface elements provided by the OS for accessibility tools (screen readers), used here to ground agent observations

DOM: Document Object Model—a tree structure representing the content of a web page

VLM: Vision-Language Model—an AI model capable of understanding and generating content based on both image and text inputs (e.g., GPT-4V)

RGB array: A grid of pixels representing the screen's visual output (Red, Green, Blue channels)

Azure Machine Learning: A cloud service for managing ML lifecycles, used here to orchestrate parallel agent evaluations