InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

📝 Paper Summary

Multimodal GUI Agents Vision-Language Navigation

InfiGUIAgent is a multimodal agent that learns native hierarchical planning and self-reflection skills through a two-stage fine-tuning process, reducing reliance on external accessibility trees.

Core Problem

Existing MLLM-based GUI agents struggle with multi-step reasoning (leading to repetitive errors) and rely heavily on text-based accessibility trees, which lose visual information and introduce computational overhead.

Why it matters:

Reliance on textual representations (accessibility trees/Set-of-Marks) causes information loss and varies by platform, hindering deployment
Lack of native reflection capabilities leads agents to repeat the same mistakes without self-correction during complex tasks
Current agents lack robust fundamental skills in understanding high-resolution, clutter-heavy mobile and computer interfaces

Concrete Example: Many agents rely on 'Set-of-Marks' (overlaying IDs on the screen) to interact. InfiGUIAgent instead processes the raw pixels directly to perform actions like clicking specific coordinates.

Key Novelty

Two-Stage SFT with Synthesized Native Reasoning

Stage 1 SFT: Enhances fundamental grounding and understanding using standardized datasets (Screen2Words, Rico) with coordinate normalization.
Stage 2 SFT: Instills 'native' reasoning by fine-tuning on trajectories synthesized by a teacher model (Qwen2-VL-72B) that include explicit hierarchical planning and expectation-reflection loops.
Reflective Reasoning: The model is trained to generate an 'expectation' before an action and a 'reflection' after observing the result, enabling self-correction.

Architecture

The two-stage supervised fine-tuning pipeline. Stage 1 focuses on Fundamental Abilities (Understanding, Grounding) using collected datasets. Stage 2 focuses on Native Reasoning (Hierarchical, Expectation-Reflection) using synthesized data from trajectories.

Evaluation Highlights

Quantitative results are not contained in the provided text (the text provided ends at Section 3.2, before the Experiments section).

Breakthrough Assessment

7/10

Proposes a robust methodology for internalizing reasoning (vs. prompting) and removing dependency on accessibility trees, which is a significant architectural shift for GUI agents.

⚙️ Technical Details

Problem Definition

Setting: Agent interaction with a mobile environment

Inputs: Goal g, Current observation o_t (screenshot), Historical context H_t

Outputs: Reasoning process r_t and Action a_t

Pipeline Flow

Input Processing (Goal + Screenshot + History)
Strategic Layer (Summary + Planning)
Tactical Layer (Action Selection + Grounding)
Action Execution (Function Call)

System Modules

Input Encoder

Encodes the goal, history, and current screenshot observation

Model or implementation: Not explicitly named in provided text (likely same as base model)

Strategic Reasoner (Reasoning)

Analyzes the overall objective and determines the sequence of sub-goals

Model or implementation: Learned weights (via SFT)

Tactical Reasoner (Reasoning)

Selects concrete GUI operations and grounds them to coordinates

Model or implementation: Learned weights (via SFT)

Action Generator

Outputs the final function call

Model or implementation: Learned weights (via SFT)

Novel Architectural Elements

Integration of Hierarchical (Strategic/Tactical) reasoning layers directly into the generation stream via SFT
Expectation-Reflection loop embedded in the inference process (generating expectations for next-step verification)

Modeling

Base Model: Not explicitly named in provided text (Qwen2-VL-72B is used as the 'teacher' for data synthesis, but the student model is not specified)

Training Method: Two-stage Supervised Fine-Tuning (SFT)

Training Data:

Stage 1: Existing datasets (Screen2Words, Rico, GUIEnv, etc.) standardized to [0, 1000] coordinate scale
Stage 2: Synthesized reasoning data using Qwen2-VL-72B on existing trajectories (generating description, reflection, strategy, tactics, expectation)

Key Hyperparameters:

coordinate_scale: [0, 1000]

Compute: Not reported in the paper

Comparison to Prior Work

vs. AppAgent/CogAgent: InfiGUIAgent focuses on 'native' reasoning (internalizing the planning/reflection steps via SFT) rather than just perception or prompting.
vs. ILuvUI: Uses a two-stage pipeline specifically adding a reasoning synthesis stage after fundamental understanding.
vs. General Agents: Removes reliance on Accessibility Trees or Set-of-Marks, processing visual inputs directly with reference-augmented annotations.

Limitations

The method relies on synthesized data from a larger model (Qwen2-VL-72B), which may propagate hallucinations or errors from the teacher.
Expectation generation deliberately ignores the actual next state to prevent leakage, which might limit the precision of state transition modeling during training.
Requires extensive preprocessing and format standardization across diverse datasets.

Reproducibility

Code: https://github.com/Reallm-Labs/InfiGUIAgent

Code available at https://github.com/Reallm-Labs/InfiGUIAgent. The text describes extensive data preprocessing (coordinate normalization, instruction enhancement) and synthesis using Qwen2-VL-72B.

📊 Experiments & Results

Evaluation Setup

Not reported in the provided text (Section 3.2 ends the document)

Metrics:

Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper proposes a shift from text-augmented GUI agents (using accessibility trees) to visually-native agents.
Reasoning capabilities (planning, reflection) are treated as learnable skills via SFT on synthesized thoughts, rather than emergent properties of prompting.
Data quality is prioritized through coordinate standardization and 'Reference-Augmented Annotation' to handle the unique spatial nature of GUIs.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Supervised Fine-Tuning (SFT)
Reinforcement Learning terminology (State, Action, Observation)

Key Terms

GUI: Graphical User Interface—visual interface on computers/phones

MLLM: Multimodal Large Language Model—AI that processes both text and images

SFT: Supervised Fine-Tuning—training a model on labeled examples

Accessibility Tree: A text-based structural representation of a UI (e.g., HTML DOM or Android View hierarchy) used by screen readers

Set-of-Marks: A technique where numerical IDs are overlaid on image elements to help models reference them

Hierarchical Reasoning: Breaking a task into high-level strategy (planning) and low-level tactics (execution)

Expectation-Reflection: A reasoning pattern where the agent predicts an outcome before acting and evaluates the result afterward

Grounding: Linking textual concepts (e.g., 'Submit button') to specific visual coordinates

Trajectory: A sequence of states, actions, and observations recorded during a task interaction