Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction

📝 Paper Summary

GUI Agents Reward Modeling Reinforcement Learning with Verifiable Rewards (RLVR)

VAGEN replaces passive visual judges with an agentic verifier that actively uses tools (shell, Python, computer use) to probe the environment and verify if a GUI task was successfully completed.

Core Problem

Existing evaluation methods for GUI agents either rely on unscalable manual scripts (rule-based) or passive visual observation (LLM-as-a-Judge), which fails to detect latent system states hidden from screenshots.

Why it matters:

Partial state observability prevents passive judges from seeing critical non-visual evidence (e.g., file attributes, background processes), leading to inaccurate rewards
Rule-based verification is brittle and cannot scale to open-ended tasks or large-scale Reinforcement Learning (RL) training
Inaccurate reward signals hinder the optimization of GUI agents during RLVR (Reinforcement Learning with Verifiable Rewards)

Concrete Example: For the task 'Help me buy the book Reinforcement Learning', an LLM-as-a-Judge might mistakenly approve a task based on a 'Thank You' screen, failing to verify if the correct item was actually purchased in the backend order history, which requires active clicking or probing.

Key Novelty

Verification via Agentic Environment Interaction (VAGEN)

Empowers the reward model (verifier) with the same interactive capabilities as the actor, allowing it to actively probe the environment (e.g., open files, run shell commands) rather than just looking at screenshots
Implements a 'Progressive Verification Mechanism' that attempts cheap static checks first, then visual retrospection, and finally active probing only when necessary
Utilizes a 'Read-Only Scaling' strategy for test-time compute, allowing multiple verification attempts without expensive environment resets by restricting the verifier to non-destructive actions

Architecture

The overall VAGEN framework and the Progressive Verification Mechanism flow.

Evaluation Highlights

Increases evaluation accuracy on OSWorld-Verified (Balanced) from 84.7% (LLM-as-a-Judge) to 92.9% (+8.2%)
Improves evaluation accuracy on OSWorld-Verified (Imbalanced) from 85.3% to 93.4% (+8.1%)
Demonstrates that verifier agents achieve higher success rates than actor agents, validating the 'easy to verify, hard to solve' property of GUI tasks

Breakthrough Assessment

8/10

Significant paradigm shift from passive observation to active probing for reward modeling. Addresses a critical bottleneck in GUI agent evaluation (partial observability) with substantial empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Binary reward modeling for GUI automation tasks

Inputs: User task q, Terminal state s_n, Actor trajectory T (sequence of reasoning, actions, and screenshots)

Outputs: Predicted reward R (0 or 1) and Confidence C (Low, Medium, High)

Pipeline Flow

Trajectory Memory Consolidation (Summarize actor actions)
Progressive Verification (Static -> Retro -> Proactive)
Read-Only Scaling (Multiple verifications if needed)

System Modules

Trajectory Summarizer

Condense the actor's trajectory into a concise operation history, discarding subjective reasoning (sub-goal analysis) to focus on objective actions

Model or implementation: LLM (Model π_s)

Verifier Agent

Determine if the task was successfully completed by observing the summary and actively probing the environment

Model or implementation: Claude-Sonnet-4.5 (Example instance of π_e)

Novel Architectural Elements

Progressive Verification Mechanism: A staged control flow where the agent only uses expensive active probing tools if static visual evidence is insufficient
Active Tool Use for Reward Modeling: Integration of 'Execute Shell' and 'Computer Use' specifically for verification purposes

Modeling

Base Model: Claude-Sonnet-4.5 (used as the Verifier Agent)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLM-as-a-Judge: VAGEN actively interacts with the environment to uncover latent states, whereas LLM-as-a-Judge is limited to passive visual observation
vs. Rule-based methods: VAGEN is scalable and handles open-ended tasks without requiring manual script engineering for every new task
vs. DigiRL/WebRL: VAGEN operates online with active probing capabilities rather than filtering static offline datasets [not cited in paper as direct baseline, but conceptual comparison]

Limitations

Dependency on a strong underlying agent model (e.g., Claude-Sonnet-4.5) for the verifier
Active verification increases inference latency and cost compared to passive visual judging
Read-only scaling requires careful definition of read-only actions to prevent environment corruption

📊 Experiments & Results

Evaluation Setup

Verify the correctness of GUI agent trajectories on operating system tasks

Benchmarks:

OSWorld-Verified (Desktop GUI automation (Windows/Linux))
AndroidWorld (Mobile GUI automation)

Metrics:

Evaluation Accuracy (Agreement with ground truth)
Actor Success Rate (via Rejection Sampling)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
OSWorld-Verified (Balanced)	Evaluation Accuracy	84.7	92.9	+8.2
OSWorld-Verified (Imbalanced)	Evaluation Accuracy	85.3	93.4	+8.1

Experiment Figures

Illustration of the Test-Time Scaling strategy (Best-of-N) using VAGEN.

Main Takeaways

VAGEN significantly outperforms passive LLM-as-a-Judge approaches by leveraging active interaction to resolve partial observability.
The 'easy to verify, hard to solve' property holds for GUI agents; verifiers consistently achieve higher success rates than actors.
Test-time scaling (Best-of-N) guided by VAGEN further improves actor performance, although specific delta numbers for the actor improvement are not detailed in the snippet.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
GUI Agents (ReAct paradigm)
LLM-as-a-Judge evaluation metrics

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using binary success signals to train agents via reinforcement learning

GUI Agent: An AI agent that interacts with a graphical user interface (e.g., clicking buttons, typing) to perform tasks

Partial State Observability: The condition where an agent (or verifier) cannot see the full state of the system (e.g., hidden files, memory) through visual screenshots alone

Agentic Interactive Verification: A paradigm where the evaluator is an agent capable of executing actions to verify task completion, rather than just a passive observer

Latent State: System properties not visible on the screen, such as file permissions, background processes, or file content not currently open

Rejection Sampling: A technique where multiple solutions are generated, and a verifier selects the best one to submit

Best-of-N: An inference strategy where N trajectories are generated, and the one with the highest reward model score is selected