← Back to Paper List

OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, Heng Ji
University of Illinois Urbana-Champaign, Stevens Institute of Technology
arXiv (2026)
Agent MM Benchmark

📝 Paper Summary

GUI Agents Autonomous Exploration Agentic Workflow
OSExpert allows agents to autonomously explore software to build a verified skill set, enabling fast, single-pass planning and precise execution without relying on inefficient trial-and-error at inference time.
Core Problem
Current computer-use agents suffer from low success rates on long-horizon tasks, struggle with unseen UIs, and are 5–50× slower than humans due to reliance on step-by-step planning and inefficient test-time scaling.
Why it matters:
  • Existing agents fail to acquire environment-specific procedural knowledge, leading to cascading errors in complex workflows.
  • Reliance on blind trial-and-error exploration during inference creates unacceptable latency for real-world applications.
  • General-purpose agents lack the fine-grained control needed for professional software (e.g., precise image editing or data visualization).
Concrete Example: In GIMP (image editing), a standard agent trying to 'select an object' might repeatedly click incorrectly or hallucinate tool locations. It wastes time re-planning after every step. In contrast, OSExpert uses a pre-learned 'scissor select' skill that automatically calls a segmentation primitive to trace the object boundary perfectly in one go.
Key Novelty
GUI-DFS Environment Learning
  • Instead of learning from human demos, the agent autonomously explores the software using a Depth-First Search (DFS) strategy to discover unit functions and save them as verified skills.
  • Constructs a 'Skill Set' that maps high-level goals to verified action sequences, allowing the agent to recognize its own capabilities and limitations.
  • Replaces step-by-step reasoning with a 'Fast Planner' that generates complete plans in one pass using the learned procedural knowledge.
Architecture
Architecture Figure Figure 2
The dual-phase framework of OSExpert: (1) Bottom-up Self-Exploration using GUI-DFS to build a skill set, and (2) Efficient Inference using a Fast Planner and Skill Check.
Evaluation Highlights
  • Achieves ~30% success rate on long-horizon tasks in OSExpert-Eval, tripling the performance of existing agents which peak at ~10%.
  • Closes the efficiency gap to human experts by ~80% compared to the most efficient existing agent baselines.
  • Demonstrates reliable transfer to unseen UIs (e.g., Tableau, MiniWord) where baselines typically score near 0-10%.
Breakthrough Assessment
8/10
Significant shift from test-time scaling to pre-inference environment exploration. The claimed ~80% efficiency gain and 3x success rate improvement on complex tasks suggest a major practical advancement over current step-by-step agents.
×