Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

📝 Paper Summary

Medical Vision-Language Models (Med-VLMs) Gaze-guided machine learning

The paper trains medical VLMs to follow the sequential visual search patterns of radiologists by using ordered eye-gaze data to supervise dedicated latent gaze tokens.

Core Problem

Standard VLMs process images as tokens but perform reasoning primarily in text space, ignoring the structured, sequential visual search process radiologists use to gather evidence.

Why it matters:

Radiology diagnosis is inherently visual and sequential; converting visual signals immediately to text loses critical temporal evidence-gathering cues
Existing methods treat gaze merely as a static spatial attention map (heatmap), discarding the 'reasoning' encoded in the order of fixation
Purely text-based reasoning in VLMs can lead to hallucinations or failures in grounding findings to specific image regions

Concrete Example: A standard VLM might correctly identify 'edema' but fail to localize it or understand which regions confirmed the diagnosis. By contrast, this method forces the model to 'look' at the hilar regions first, then the lung bases, mirroring a radiologist's scanpath.

Key Novelty

Sequential Gaze-Supervised Tokens

Introduces dedicated 'gaze tokens' into the VLM's vocabulary that act as placeholders for intermediate visual evidence
Supervises these tokens to predict image patch indices in the exact temporal order of a radiologist's eye-tracking scanpath
Forces the model to internalize a human-like 'where to look next' strategy rather than just learning a static heatmap of important regions

Architecture

The model architecture showing the Qwen2.5-VL backbone, the insertion of special <st> tokens, and the two branches of supervision: patch index prediction (gaze) and text generation/classification.

Evaluation Highlights

Achieves state-of-the-art in-domain performance on MIMIC-EYE (90.17 AUROC), surpassing both random (86.45) and shuffled (88.51) gaze baselines
Demonstrates strong zero-shot robustness on external benchmarks, e.g., +5.09 F1 on RSNA compared to instruction-tuned baselines
Original-ordered gaze consistently outperforms shuffled gaze, confirming that the temporal sequence of visual search encodes valuable reasoning signals

Breakthrough Assessment

8/10

Novel integration of temporal gaze sequences as direct token supervision rather than auxiliary loss. Strong empirical gains and improved interpretability in medical imaging.

⚙️ Technical Details

Problem Definition

Setting: Chest X-ray report generation and multi-label classification using gaze-augmented supervision

Inputs: Chest X-ray image I, instruction prompt

Outputs: Structured text report and 14-label classification vector y

Pipeline Flow

Visual Encoding (Image to Visual Tokens)
Prompt Construction (Image tokens + Instruction)
VLM Decoding (Generates Gaze Tokens <st> then Text)
Gaze Supervision (Projects <st> to Patch Indices)
Classification Head (Predicts 14 labels)

System Modules

Vision Encoder

Encodes chest X-ray into sequence of visual tokens

Model or implementation: Qwen2.5-VL-7B-Instruct (Vision part)

Gaze Projection Head

Maps hidden states of gaze tokens to probabilities over image patches

Model or implementation: Linear projection

Language Model

Generates structured report and gaze tokens autoregressively

Model or implementation: Qwen2.5-VL-7B-Instruct (LLM part)

Classification Head

Predicts presence of 14 findings

Model or implementation: Linear classifier

Novel Architectural Elements

Insertion of exactly four dedicated 'gaze tokens' (<st>) at the start of response generation to serve as evidence carriers
Direct supervision of these latent tokens to predict discrete image patch indices corresponding to temporal gaze fixation clusters

Modeling

Base Model: Qwen2.5-VL-7B-Instruct

Training Method: Two-stage supervised fine-tuning with LoRA

Objective Functions:

Purpose: Align gaze tokens with radiologist fixation patches.

Formally: Cross-entropy over patch IDs L_gaze = - sum(log P(p|z)) for target patches.
Purpose: Multi-label disease classification.

Formally: Binary Cross-Entropy L_cls.
Purpose: Maintain text generation capability.

Formally: Autoregressive language modeling loss L_lm.

Adaptation: LoRA (Low-Rank Adaptation)

Key Hyperparameters:

gaze_loss_weight_lambda: 0.7
gaze_tokens_count: 4
gaze_supervision_type: top-k patch indices per token

Compute: 8x NVIDIA RTX A6000 (24GB)

Comparison to Prior Work

vs. Heatmap-Gaze: Preserves temporal order of gaze (scanpath) via sequential token supervision rather than static maps
vs. CoVT: Uses external human gaze data as ground truth for visual tokens rather than self-generated visual states
vs. Vanilla SFT: Explicitly supervises intermediate reasoning steps with human attention data

Limitations

Requires synchronized eye-tracking data which is scarce and expensive to collect
Fixed number of gaze tokens (4) may not capture complex scanpaths with varying lengths
Implementation details for 'top-k patches' and gaze-audio alignment heuristics are brief

Reproducibility

No code or model weights provided ('not provided'). Uses public MIMIC-EYE dataset. Implementation relies on Qwen2.5-VL backbone.

📊 Experiments & Results

Evaluation Setup

Chest X-ray multi-label classification (14 diseases)

Benchmarks:

MIMIC-EYE (In-domain classification)
CheXpert 5x200 (Zero-shot out-of-domain classification)
RSNA (Zero-shot out-of-domain classification)
SIIM-ACR (Zero-shot out-of-domain classification)

Metrics:

AUROC
Accuracy
F1-score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
In-domain results on MIMIC-EYE show that sequential gaze supervision (Original-Gaze) outperforms baselines and shuffled variants.
MIMIC-EYE	AUROC	87.60	90.17	+2.57
MIMIC-EYE	AUROC	88.51	90.17	+1.66
MIMIC-EYE	AUROC	86.45	90.17	+3.72
Zero-shot generalization results on external benchmarks demonstrate robustness improvements.
RSNA	F1-score	48.64	53.73	+5.09
CheXpert 5x200	Accuracy	55.60	62.45	+6.85

Main Takeaways

Sequential gaze supervision consistently outperforms static heatmap or text-only baselines.
Preserving the temporal order of gaze fixations is critical; shuffling the order reduces performance gains, suggesting the 'scanpath' encodes reasoning.
Models trained with gaze supervision generalize better to unseen datasets (zero-shot), likely by learning transferable visual search strategies.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) and tokenizer architectures
Basic understanding of eye-tracking data (fixations, scanpaths)
Medical imaging (Chest X-ray) diagnosis tasks

Key Terms

VLM: Vision-Language Model—AI that processes both images and text to generate outputs

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification tasks

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights

gaze tokens: Special placeholder tokens added to the model's vocabulary, trained to represent specific visual locations (image patches) attended to by human experts

scanpath: The temporal sequence of eye movements (fixations) across an image

MIMIC-EYE: A dataset linking MIMIC-CXR chest X-rays with eye-tracking data collected from radiologists

heatmap: A spatial map showing distribution of attention or gaze density, typically discarding temporal order

CheXpert/RSNA/SIIM-ACR: Standard benchmark datasets for chest X-ray disease classification