← Back to Paper List

Towards Enterprise-Ready Computer Using Generalist Agent

Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Aviad Sela, Asaf Adi, Nir Mashkif
IBM Research
arXiv.org (2025)
Agent MM Benchmark Reasoning

📝 Paper Summary

Web agents Multi-task planning
CUGA achieves state-of-the-art performance on WebArena and AppWorld by evolving from a simple agent loop into a specialized multi-agent architecture separating high-level planning from specific web and API execution.
Core Problem
Generalist agents often fail at complex, long-horizon enterprise tasks because simple plan-act-observe loops struggle with context maintenance, variable propagation, and precise UI/API interaction.
Why it matters:
  • Simple architectures achieve only ~15% success on WebArena, insufficient for real-world enterprise adoption requiring high reliability
  • Enterprise workflows require handling privacy, safety, and complex multi-step processes across diverse applications, which single-loop agents cannot manage effectively
  • Existing benchmarks like AppWorld require dynamic API selection and reasoning about preconditions, capabilities often missing in standard web agents
Concrete Example: In an initial version, the planner identified the correct action (select from dropdown) but failed execution because the UI implementation was non-standard. Similarly, API agents failed to shortlist relevant APIs from verbose OpenAPI specs.
Key Novelty
Iterative Multi-Agent Architecture Evolution
  • Decomposes the single agent into a 'Plan Controller' for high-level strategy and specialized 'Sub-task Plan-Execute Agents' for specific Web/API modalities
  • Introduces an 'API Registry' with minimized OpenAPI representations to enable scalable API shortlisting and execution
  • Implements a 'Smart Sampling' methodology that evaluates on small, representative subsets first, enabling rapid failure analysis and architectural refinement before scaling up
Architecture
Architecture Figure Figure 5
High-level representation of the final CUGA architecture, showing the orchestration between Plan Controller and Sub-task Agents.
Evaluation Highlights
  • 61.7% task completion on WebArena benchmark, setting a new state-of-the-art (SOTA)
  • 46% scenario completion rate on AppWorld benchmark, also achieving SOTA performance
  • Initial simple architecture achieved only 15% on WebArena and 5% on AppWorld; architectural evolution drove massive gains
Breakthrough Assessment
8/10
Achieves SOTA on two major agentic benchmarks (WebArena and AppWorld) through a clearly documented architectural evolution. While the core components (planning, tool use) are known, the specific integration and iterative methodology for enterprise readiness are significant.
×