Video-Based Reward Modeling for Computer-Use Agents

📝 Paper Summary

Computer-Use Agents (CUA) Reward Modeling Video Understanding

ExeVRM is a model-agnostic evaluator that judges computer-use agent success directly from execution videos, utilizing spatiotemporal pruning and synthetically augmented data to outperform proprietary models like GPT-5.2.

Core Problem

Evaluating Computer-Use Agents (CUAs) is difficult because existing benchmarks rely on brittle, unscalable hand-crafted scripts, while outcome-based checks miss intermediate errors.

Why it matters:

Manual scripting for evaluation limits the scalability and transferability of agents to new environments
Public datasets lack negative supervision (failure cases), making it hard to train discriminative reward models
High-resolution execution videos are computationally expensive to process due to massive redundancy in static UI elements

Concrete Example: A trajectory might technically reach a final state but fail due to a subtle cue like a transient error dialog or incorrect cursor focus. A script-based evaluator might crash if the UI layout changes slightly, whereas a video-based model should robustly perceive the failure.

Key Novelty

Execution Video Reward Modeling (ExeVRM) with Spatiotemporal Pruning

Treats agent evaluation as a video understanding task: inputs are user instructions and execution videos (sequences of keyframes), independent of internal agent traces (logs/code)
Uses Spatiotemporal Token Pruning (STP & TTP) to discard redundant background pixels and static temporal frames, preserving only decisive UI changes (e.g., cursor moves, text edits)
Generates hard negative training examples via 'Adversarial Instruction Translation', where a model creates plausible but mismatched instructions for successful trajectories

Architecture

The Spatiotemporal Token Pruning workflow: Spatial Pruning (STP) per frame followed by Temporal Pruning (TTP) across frames

Evaluation Highlights

ExeVRM 8B achieves 84.7% accuracy on video-execution assessment, surpassing Seed-2.0 Pro (+4.4%) and GPT-5.2 (+9.7%)
Achieves 87.7% recall, significantly outperforming Seed-2.0 Pro (74.7%) and GPT-5.2 (66.5%)
Maintains tractable training costs on long-horizon videos by using 720p inputs with token pruning, which improves completion judgment compared to downsampled 360p baselines

Breakthrough Assessment

9/10

Proposes a scalable, model-agnostic evaluation standard for the rapidly growing field of computer-use agents, effectively addressing the data scarcity and compute bottlenecks of video-based reward modeling.

⚙️ Technical Details

Problem Definition

Setting: Video-based outcome assessment and error localization

Inputs: User instruction and a video-execution sequence (sequence of keyframes)

Outputs: Binary judgment of task success and temporal attribution of failure (if applicable)

Pipeline Flow

Preprocessing: Step Segmentation & Keyframe Extraction
Group: Token Pruning (STP + TTP)
Vision-Language Encoding (Qwen3-VL)
Prediction Head

System Modules

Video Segmenter

Converts interaction logs into a step-level video representation

Model or implementation: Deterministic script

Spatial Token Pruner (STP) (Group: Token Pruning)

Removes spatially redundant background regions within each frame

Model or implementation: Graph-based clustering (parameter-free)

Temporal Token Pruner (TTP) (Group: Token Pruning)

Removes tokens that do not change across frames to focus on transitions

Model or implementation: Cosine similarity comparator

ExeVRM

Encodes instruction and pruned video tokens to predict success

Model or implementation: Qwen3-VL (8B)

Novel Architectural Elements

Spatiotemporal Token Pruning (STP+TTP) integrated into the input pipeline to enable high-resolution (720p) long-horizon video processing

Modeling

Base Model: Qwen3-VL

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

ExeVR-53k dataset: 53,000 triplets
Sources: AgentNet (22k), ScaleCUA, OSWorld (rollouts from 30 agents)
Negatives: Synthetic negatives via Adversarial Instruction Translation (verified 100% human pass rate on subset)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ORMs: ExeVRM uses full video context for better credit assignment
vs. PRMs: ExeVRM performs holistic video-level judgment rather than O(n) step-wise inference
vs. GUI-Pruner [cited]: ExeVRM uses simpler spatial pruning and more robust temporal strategies for offline reward modeling rather than online agent memory saving
+ 1 more
vs. VAGEN [cited]: ExeVRM relies only on external video, independent of agent thoughts/traces

Limitations

Depends on the quality of the underlying VLM (Qwen3-VL) for visual interpretation
Adversarial translation requires manual verification to ensure high quality of negatives
Pruning thresholds (STP/TTP) may need tuning for different UI densities

Reproducibility

Code: https://github.com/limenlp/ExeVRM

Code available at https://github.com/limenlp/ExeVRM. Dataset (ExeVR-53K) and Model (ExeVRM-8B) available on HuggingFace. Paper details algorithms for STP and TTP explicitly.

📊 Experiments & Results

Evaluation Setup

Reward modeling accuracy on held-out test set (ExeVR-Bench)

Benchmarks:

ExeVR-Bench (Video-based task success judgment) [New]

Metrics:

Accuracy
Recall
tIoU (Temporal Intersection over Union)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ExeVRM 8B outperforms both proprietary and open-weight baselines on the ExeVR-Bench outcome assessment task.
ExeVR-Bench	Accuracy	80.3	84.7	+4.4
ExeVR-Bench	Accuracy	75.0	84.7	+9.7
ExeVR-Bench	Recall	74.7	87.7	+13.0
ExeVR-Bench	Recall	66.5	87.7	+21.2

Experiment Figures

Task composition of ExeVR-53k across different data sources (AgentNet, ScaleCUA, OSWorld)

Illustration of Adversarial Instruction Translation

Main Takeaways

Video-execution reward modeling effectively evaluates agent performance across diverse OS environments (Ubuntu, macOS, Windows, Android)
Dense video context provides critical signals that sparse screenshot-based evaluation misses
Spatiotemporal Token Pruning (STP+TTP) enables the use of high-resolution (720p) inputs, which improves recall compared to low-res (360p) inputs while maintaining tractable training costs
Adversarial instruction translation provides necessary negative supervision to learn subtle failure modes

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Computer-Use Agents / GUI Agents
Visual Token Pruning

Key Terms

CUA: Computer-Use Agents—AI systems that operate directly on real-world interfaces (desktops, browsers) to perform tasks

ExeVR-53k: The proposed dataset of 53k high-quality video–task–reward triplets derived from AgentNet, ScaleCUA, and OSWorld

STP: Spatial Token Pruning—removes visually homogeneous regions (like static backgrounds) within a single frame to save tokens

TTP: Temporal Token Pruning—suppresses tokens that remain unchanged across consecutive frames to focus on state transitions

Adversarial Instruction Translation: A data augmentation method where a model generates a mismatched instruction for a valid trajectory to create a 'hard negative' training pair

tIoU: Temporal Intersection over Union—a metric used to evaluate how well the model localizes the specific time span where an error occurred