Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable DocumentRAGvia Reinforcement Learning

📝 Paper Summary

Visual Document Retrieval-Augmented Generation (VD-RAG) Multimodal Reasonining Visual Evidence Attribution

Chain-of-Evidence (CoE) combines Chain-of-Thought with reinforcement learning to make vision-language models explicitly ground intermediate reasoning steps in specific document regions.

Core Problem

Current Vision-Language Models (VLMs) in RAG scenarios lack traceable reasoning and often hallucinate, directly jumping to answers without showing the progressive search process used by humans.

Why it matters:

Without reliable attribution, users cannot verify if the model's answer is based on actual document content or hallucination
Existing methods link answers to evidence only at the end, failing to reveal the intermediate reasoning path (traceability) crucial for complex multi-step queries
Training data for step-by-step visual attribution is scarce and expensive to annotate manually

Concrete Example: In a multi-page document query, a standard VLM might correctly answer '25%' but fail to show which table or paragraph it came from. Or, it might cite the wrong chart entirely. Humans solve this by first finding the chapter, then the section, then the specific table—a process standard VLMs do not replicate.

Key Novelty

Look-As-You-Think (LAT) Reinforcement Learning Framework

Formalizes 'Chain-of-Evidence' (CoE) where reasoning steps are explicitly linked to bounding boxes and page indices (coarse-to-fine grounding)
Uses a reinforcement learning approach (LAT) with a 'stepwise attribution reward' that checks if the image region inside a predicted bounding box semantically matches the reasoning text
Optimizes for both answer accuracy and the validity of the evidence trail, encouraging the model to 'look' at the right place while 'thinking'

Architecture

The LAT training pipeline, showing Stage I (Cold-Start SFT) and Stage II (Reinforcement Learning with GRPO).

Evaluation Highlights

Achieves +8.23% improvement in soft Exact Match (EM) over vanilla Qwen2.5-VL-7B-Instruct on VISA benchmarks
Improves Intersection over Union (IoU@0.5) by 47.0%, significantly boosting the precision of visual evidence localization
Outperforms supervised fine-tuning baselines, demonstrating that RL generalizes better than simple imitation of grounded reasoning traces

Breakthrough Assessment

8/10

Significantly advances explainable AI in multimodal RAG by enforcing verifiable intermediate steps via RL, addressing the critical 'black box' reasoning problem in VLMs.

⚙️ Technical Details

Problem Definition

Setting: Visual Document RAG where a model must answer a query q given document pages P, producing answer a and evidence bounding boxes

Inputs: Textual query q and a set of pre-retrieved document pages P

Outputs: Final answer a, supporting page index i*, bounding box B_ans, and a reasoning chain where steps are grounded with (i_t, B_t)

Pipeline Flow

Input Processing (Query + Document Pages)
CoE Generation (VLM generates reasoning steps + bounding boxes)
Outcome & Process Evaluation (Reward computation)
Policy Update (GRPO)

System Modules

Policy Model (VLM)

Generates the reasoning chain (text + bounding boxes) and final answer

Model or implementation: Qwen2.5-VL-7B-Instruct

Reward Computer

Calculates rewards based on answer accuracy, formatting, and semantic alignment of evidence

Model or implementation: ColQwen2 (for semantic similarity)

Novel Architectural Elements

Chain-of-Evidence (CoE) output structure: interleaved text reasoning and visual bounding box tokens representing intermediate evidence
Stepwise Attribution Reward mechanism: uses a frozen retriever (ColQwen2) to verify if the 'looked at' region matches the 'thought about' text dynamically during RL

Modeling

Base Model: Qwen2.5-VL-7B-Instruct

Training Method: Reinforcement Learning (GRPO) following SFT Cold-Start

Objective Functions:

Purpose: Ensure answer correctness.

Formally: R_acc based on soft exact match (EM) and recall overlap.
Purpose: Verify intermediate reasoning.

Formally: R_step measures cosine similarity between reasoning text r_t and cropped image B_t using ColQwen2.
Purpose: Ensure final evidence accuracy.

Formally: R_ground measures IoU between predicted final evidence box and ground truth.
Purpose: Enforce structured output.

Formally: R_format penalizes outputs missing <think> or <answer> tags.

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

Cold-start data: 1,000 instances sampled and annotated by Gemini 2.5 Pro
Manual verification retains ~30% of samples where bounding boxes are correct

Key Hyperparameters:

sampling_rate: 5% of training data for RL
annotation_model: Gemini 2.5 Pro (for cold start)

Compute: Not reported in the paper

Comparison to Prior Work

vs. VISA: VISA only grounds the final answer; LAT grounds the entire reasoning chain (CoE) step-by-step
vs. DeepSeek-R1: Extends the R1-style RL paradigm to Multimodal/Visual domains with specific visual attribution rewards
vs. Standard CoT: Adds explicit bounding box constraints to reasoning steps, forcing the model to 'look' while thinking

Limitations

Relies on a proprietary model (Gemini 2.5 Pro) for generating high-quality cold-start training data
Computational cost of RL with visual reward computation (cropping and encoding images for every step) is likely high, though not explicitly quantified
Requires ground truth answers for the outcome reward, limiting applicability to open-ended tasks without known answers

Reproducibility

Code availability is not provided in the paper text. The method relies on Gemini 2.5 Pro for data synthesis, a proprietary model. The rewards use ColQwen2, which is open weights.

📊 Experiments & Results

Evaluation Setup

Visual Document Question Answering with evidence localization

Benchmarks:

Wiki-VISA (Wikipedia-based visual QA)
Paper-VISA (Academic paper layout QA)
FineWeb-VISA (Web passage QA)

Metrics:

Soft Exact Match (EM)
Intersection over Union (IoU@0.5)
Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LAT consistently improves over the vanilla model and SFT baseline across single-image and multi-image settings.
Average (Paper-VISA & Wiki-VISA)	Soft EM improvement	Not reported as absolute number (percentage gain only)	Not reported as absolute number (percentage gain only)	+8.23%
Average (Paper-VISA & Wiki-VISA)	IoU@0.5 improvement	Not reported as absolute number (percentage gain only)	Not reported as absolute number (percentage gain only)	+47.0%

Experiment Figures

Comparison of human reasoning vs. VLM reasoning (VISA baseline vs. Proposed CoE).

Main Takeaways

RL (LAT) outperforms Supervised Fine-Tuning (SFT) alone, suggesting that the model learns to reason and ground better through exploration than just imitation.
The method is effective in both single-image and multi-image scenarios, addressing the challenge of finding evidence across multiple document pages.
The stepwise reward mechanism successfully encourages the model to produce verifiable reasoning traces rather than just accurate final answers.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Vision-Language Models (VLMs)
Retrieval-Augmented Generation (RAG)
Chain-of-Thought (CoT) prompting

Key Terms

VD-RAG: Visual Document Retrieval-Augmented Generation—answering questions based on visual document content (PDFs, slides) rather than just plain text

CoE: Chain-of-Evidence—a proposed reasoning paradigm where each textual reasoning step is explicitly linked to a visual region (bounding box) in the document

LAT: Look-As-You-Think—the proposed reinforcement learning framework that trains models to generate CoE traces by rewarding correct answers and semantically aligned visual grounding

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies by comparing a group of outputs for the same input, removing the need for a critic model

IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box

SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) before applying reinforcement learning

ColQwen2: A multimodal retriever model used here to calculate the semantic similarity between a cropped image region and the reasoning text for the reward signal