OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

📝 Paper Summary

GUI Agents Autonomous Exploration Agentic Workflow

OSExpert allows agents to autonomously explore software to build a verified skill set, enabling fast, single-pass planning and precise execution without relying on inefficient trial-and-error at inference time.

Core Problem

Current computer-use agents suffer from low success rates on long-horizon tasks, struggle with unseen UIs, and are 5–50× slower than humans due to reliance on step-by-step planning and inefficient test-time scaling.

Why it matters:

Existing agents fail to acquire environment-specific procedural knowledge, leading to cascading errors in complex workflows.
Reliance on blind trial-and-error exploration during inference creates unacceptable latency for real-world applications.
General-purpose agents lack the fine-grained control needed for professional software (e.g., precise image editing or data visualization).

Concrete Example: In GIMP (image editing), a standard agent trying to 'select an object' might repeatedly click incorrectly or hallucinate tool locations. It wastes time re-planning after every step. In contrast, OSExpert uses a pre-learned 'scissor select' skill that automatically calls a segmentation primitive to trace the object boundary perfectly in one go.

Key Novelty

GUI-DFS Environment Learning

Instead of learning from human demos, the agent autonomously explores the software using a Depth-First Search (DFS) strategy to discover unit functions and save them as verified skills.
Constructs a 'Skill Set' that maps high-level goals to verified action sequences, allowing the agent to recognize its own capabilities and limitations.
Replaces step-by-step reasoning with a 'Fast Planner' that generates complete plans in one pass using the learned procedural knowledge.

Architecture

The dual-phase framework of OSExpert: (1) Bottom-up Self-Exploration using GUI-DFS to build a skill set, and (2) Efficient Inference using a Fast Planner and Skill Check.

Evaluation Highlights

Achieves ~30% success rate on long-horizon tasks in OSExpert-Eval, tripling the performance of existing agents which peak at ~10%.
Closes the efficiency gap to human experts by ~80% compared to the most efficient existing agent baselines.
Demonstrates reliable transfer to unseen UIs (e.g., Tableau, MiniWord) where baselines typically score near 0-10%.

Breakthrough Assessment

8/10

Significant shift from test-time scaling to pre-inference environment exploration. The claimed ~80% efficiency gain and 3x success rate improvement on complex tasks suggest a major practical advancement over current step-by-step agents.

⚙️ Technical Details

Problem Definition

Setting: General-purpose computer use where an agent interacts with diverse Digital Environments (E) to complete natural language tasks.

Inputs: Natural language instruction, current GUI screenshot/state

Outputs: Sequence of keyboard/mouse actions to achieve the goal

Pipeline Flow

User Query -> Fast Planner -> Skill Boundary Check -> Action Execution (with Primitives) -> (Fallback: Test-time Scaling)
Note: The Learning Phase (GUI-DFS) happens prior to this inference pipeline.

System Modules

Fast Planner (Planning)

Generates a complete execution plan in a single forward pass using learned procedural knowledge

Model or implementation: Qwen-3-4B (LoRA fine-tuned on learned skills)

Skill Boundary Check (Planning)

Predicts if the task is feasible based on exploration history; terminates early if the required skill is marked as a 'failure' in the skill set

Model or implementation: Lookup/Heuristic based on Skill Set

Action Module

Executes the plan, using specific primitives for fine-grained control when necessary

Model or implementation: Qwen-3-VL-8B

Novel Architectural Elements

Separation of 'Exploration' (learning phase) and 'Fast Planning' (inference phase) to front-load computational cost.
Integration of a 'Skill Boundary Check' that explicitly uses negative exploration results (failed attempts) to prevent futile test-time scaling.

Modeling

Base Model: Qwen-3-VL-8B (Action Agent) and Qwen-3-4B (Fast Planner)

Training Method: Exploration-driven curriculum learning + LoRA Fine-tuning

Adaptation: LoRA (Low-Rank Adaptation) applied to the Fast Planner

Training Data:

Data is self-generated via GUI-DFS exploration.
Agent explores unit functions, verifies them, and proposes composite tasks to build a curriculum.
Successful trajectories become training data; failed trajectories inform the boundary check.

Key Hyperparameters:

max_retries: 4 (during exploration)
temperature: 1.0 (base model)

Compute: Inference uses Qwen-3-VL-8B and Qwen-3-4B. Exploration uses GPT-5 (Planner) and UI-TARS-1.5-7B (Action) or Qwen-3-VL-8B. Exact GPU hours not reported.

Comparison to Prior Work

vs. Agent-S: OSExpert learns skills *before* inference via exploration, whereas Agent-S relies heavily on expensive inference-time search.
vs. Standard Baselines: OSExpert generates full plans in one pass (Fast Planner) rather than step-by-step planning, reducing latency.
vs. Cradle [not cited in paper]: Cradle uses a general vision backbone for all games/software; OSExpert explicitly maps UI trees via DFS to build discrete skills.

Limitations

Exploration cost is high (requires interacting with the environment comprehensively upfront).
Relies on strong base models (GPT-5, Qwen-3) for the exploration phase to be effective.
Environment restart is required during DFS backtracking, which may be slow or complex for some applications.

Reproducibility

Code: https://github.com/Lumos-Jiateng/OSExpert

Code and data will be released at https://github.com/Lumos-Jiateng/OSExpert. The paper relies on models (GPT-5, Qwen-3) that imply a future context or specific access.

📊 Experiments & Results

Evaluation Setup

OSExpert-Eval benchmark containing 113 tasks across 4 dimensions: Long-horizon, Generalization, Fine-grained control, Efficiency.

Benchmarks:

OSExpert-Eval (GUI Manipulation (GIMP, LibreOffice, Tableau, MiniWord)) [New]

Metrics:

Success Rate
Task Completion Time (Efficiency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on OSExpert-Eval showing significant gains in success rates and efficiency compared to baselines.
OSExpert-Eval (Long-horizon tasks)	Success Rate	10	30	+20
OSExpert-Eval	Efficiency (Latency gap to human)	100	20	-80
OSExpert-Eval (Unseen UIs)	Success Rate	10	30	+20

Experiment Figures

A radar chart or comparison diagram contrasting OSExpert-Eval dimensions (Long-horizon, Generalization, Fine-grained, Efficiency) against OSWorld tasks.

Main Takeaways

Environment-specific exploration (GUI-DFS) allows agents to outperform general-purpose baselines significantly (~20% gain) on complex, professional software tasks.
Pre-learning skills and using a fast planner reduces the need for inference-time scaling, closing the execution time gap with humans by ~80%.
Fine-grained action primitives are essential for professional tools (like GIMP/Tableau), where standard agents fail due to lack of precise spatial control.
The 'Skill Boundary Check' effectively prevents wasted time on impossible tasks, contributing to the massive efficiency gains.

📚 Prerequisite Knowledge

Prerequisites

Knowledge of Vision-Language Models (VLMs) for GUI understanding
Familiarity with Reinforcement Learning or Search algorithms (DFS)
Understanding of LoRA fine-tuning

Key Terms

GUI-DFS: Graphical User Interface Depth-First Search—an exploration algorithm where the agent systematically clicks through UI elements (menus, buttons) to map out available functions.

Test-time scaling: The practice of using more computation during inference (e.g., generating multiple candidate plans or retrying steps) to improve performance, often at the cost of high latency.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights.

Action Primitives: Pre-defined, parameterized low-level behaviors (e.g., 'drag from [x1,y1] to [x2,y2]') that ensure precise execution for fine-grained tasks.

Grounding: The process of linking abstract concepts (e.g., 'the red button') to concrete coordinates or UI elements on the screen.

DFS: Depth-First Search—an algorithm for traversing tree or graph structures that explores as far as possible along each branch before backtracking.