MM Research Area Summary

📖 What is Multi-Modal LLMs?

Multi-Modal LLMs research develops AI systems that jointly perceive, reason about, and generate content across vision, language, audio, and embodied action modalities.

💡 Why it Matters

Bridging the gap between human-like multimodal cognition and current AI is essential for trustworthy visual assistants, autonomous systems, creative tools, and scientific discovery.

🎯 Key Paradigms

Vision-Language Understanding

Enabling models to answer questions about images, locate objects from text descriptions, read documents, and generate image captions by integrating visual perception with language reasoning

Multimodal Generation

Synthesizing high-fidelity images, videos, and edits from text descriptions using diffusion models, flow matching, and autoregressive transformers aligned with human preferences via reinforcement learning

Video Understanding

Processing temporal dynamics in video through question answering, temporal grounding, and causal reasoning with memory-augmented architectures and tool-augmented agents

Multimodal Reasoning and Alignment

Enhancing logical inference over visual inputs while reducing hallucinations through chain-of-thought reasoning, process reward models, and preference-based alignment

Architecture and Efficiency

Designing efficient visual encoders, compressing tokens, and optimizing multimodal pretraining to enable deployment on resource-constrained devices without sacrificing accuracy

Embodied AI and Robotics

Unifying perception, language, and physical action in vision-language-action models for robotic manipulation, autonomous driving, and world simulation

📅 Field Evolution Timeline

2023-01 to 2023-12 Foundation Era

Establishing multimodal architectures, early reasoning paradigms, and foundational benchmarks

LLaVA established the visual instruction tuning paradigm connecting CLIP to LLMs (Visual Instruction Tuning, 2023)
Multimodal Chain-of-Thought pioneered two-stage reasoning for VQA, achieving 85.31% on ScienceQA with a sub-1B model
ViperGPT created the visual programming paradigm translating queries to executable Python code
MMMU benchmark revealed a 33-point gap between GPT-4V and human experts on college-level multimodal questions
CM3Leon proved autoregressive models can rival diffusion for image generation with 5x less compute
AnimateDiff introduced plug-and-play motion modules to animate personalized text-to-image models

Shift from task-specific vision models to general-purpose multimodal LLMs via instruction tuning

2024-01 to 2024-12 Scaling and Alignment Era

Large-scale preference alignment, benchmark proliferation, efficient architectures, and zero-shot personalization

InstantID enabled plug-and-play face personalization using face recognition embeddings without fine-tuning
PaLI-X jointly scaled vision (ViT-22B) and language (32B) components to achieve 86.0 on VQAv2
Video-MME established the comprehensive benchmark for video LLM evaluation across all durations
T2V-Turbo-v2 achieved 85.13 VBench score, surpassing commercial systems Gen-3 and Kling
Preference optimization with just 5K samples was shown to reverse language degradation from visual fine-tuning
Flow matching emerged as the dominant action generation paradigm for robotic VLA models (π₀)

Shift from supervised fine-tuning to reward-based preference optimization for multimodal alignment Emergence of contrastive backbones (SigLIP) as the standard foundation for grounding tasks

2025-01 to 2025-12 RL Revolution Era

GRPO-based reinforcement learning transforms all sub-fields from understanding to generation to robotics

R1-Zero replicated emergent reasoning ('aha moment') in multimodal models via GRPO without supervised fine-tuning
Flow-GRPO boosted SD3.5-M GenEval from 63% to 95%, spawning 30+ GRPO variants for visual generation
Visual-RFT extended R1-style RL to visual tasks, improving COCO detection mAP from 9.8 to 31.3
SimpleVLA-RL achieved 91.7% on LIBERO-Long from a single demonstration via GRPO adaptation
RL-100 achieved 100% success across 1000 real-world robot evaluations and 7-hour continuous operation
olmOCR made large-scale PDF processing economically viable at $176 per million pages via distillation

Group Relative Policy Optimization (GRPO) became the dominant post-training paradigm across understanding, generation, and embodied AI Chain-of-thought reasoning extended from text-only to visual generation and robotic control

2026-01 to 2026-03 Maturation and Deployment Era

Sub-1B deployment models, agentic self-improvement, spatial intelligence, and automated content production

GLM-OCR ranked first on OmniDocBench v1.5 with a 0.9B model using multi-token prediction
VLA-Thinker introduced perception as a dynamically invocable reasoning action, tripling long-horizon success
World2Mind achieved +17.6% on VSI-Bench via training-free allocentric spatial reasoning
AdaReasoner achieved 97.6% on spatial planning via RL-trained adaptive tool orchestration
RubiCap's 7B model outperformed GPT-4V on dense captioning via rubric-guided RL

Focus expanded from compression to understanding and steering what models learn, using LLMs as semantic teachers and mechanistic interpretability to locate biases

🔧

Vision-Language Understanding

What: Research on models that jointly process visual and textual information to understand images, videos, and multimodal content for reasoning, generation, and decision-making.

Why: Bridging vision and language enables AI systems to perceive, reason about, and act upon the visual world using natural language instructions.

Baseline: Early Vision-Language Models (VLMs) like CLIP align image-text pairs via contrastive learning, then decode with a frozen Large Language Model using fixed-resolution visual tokens.

VLMs hallucinate content not present in images, prioritizing language priors over visual evidence
Fixed-resolution processing destroys fine-grained detail and prevents unified image-video understanding
Text-only Chain-of-Thought reasoning cannot actively inspect visual details like small objects or specific video frames

🧪 Running Example

❓ Given a photo of a complex restaurant menu board, read the prices of all dessert items and calculate the total cost of ordering one of each.

Baseline: A standard VLM resizes the high-resolution menu image to 336×336 pixels, destroying fine text. It hallucinates plausible-sounding prices from language priors rather than reading the actual numbers, producing an incorrect total.

Challenge: This example illustrates three key challenges: (1) fixed-resolution processing loses critical text detail; (2) the model hallucinates prices it cannot read; (3) pure text reasoning cannot zoom into specific regions to verify numbers.

✅ Dynamic Resolution VLM Architecture: Qwen2-VL processes the menu at its native resolution using dynamic patch sequences, preserving all text detail without distortion.

✅ Pixel-Space Visual Reasoning: Pixel Reasoner would zoom into the dessert section using visual operations, actively inspecting each price before reasoning about the total.

✅ Hallucination Mitigation via Preference Optimization: HA-DPO trains the model to prefer visually grounded outputs over plausible-sounding fabrications, reducing the chance of inventing prices.

📈 Overall Progress

The field has undergone three major paradigm shifts: from fixed-resolution contrastive models (CLIP era) to dynamic-resolution architectures (Qwen2-VL, NVILA), from supervised fine-tuning to reinforcement learning-based reasoning (ThinkLite-VL, Pixel Reasoner), and from passive perception to active agentic behavior with tool use and self-evolution (OpenThinkIMG, MM-Zero). Joint visual-textual reasoning now rivals or surpasses human-level performance on specific benchmarks, while 7B-parameter models routinely outperform GPT-4o on targeted tasks.

📂 Sub-topics

VLM Architecture & Efficient Training

180 papers

Core model architectures for vision-language understanding, including dynamic resolution handling, efficient token compression, and novel training paradigms like diffusion-based VLMs and data-efficient curation.

Qwen2-VL NVILA Pixtral 12B CADC

Visual Reasoning & Chain-of-Thought

150 papers

Methods that enhance VLMs' reasoning capabilities through reinforcement learning, visual chain-of-thought, spatial reasoning, and mathematical problem solving with active visual inspection.

ThinkLite-VL Pixel Reasoner OpenVLThinker DyME

Embodied AI & Robotics

130 papers

Vision-Language-Action (VLA) models for robotic manipulation, autonomous driving, and embodied navigation, leveraging VLMs for task planning, spatial understanding, and closed-loop control.

VoxPoser SimLingo Poutine BridgeVLA

Evaluation & Benchmarks

120 papers

Benchmarks and evaluation frameworks that systematically assess VLM capabilities, expose failure modes like data leakage and position bias, and measure progress in spatial reasoning, safety, and domain-specific tasks.

MMStar VisionArena DatBench POPE

Safety, Alignment & Hallucination Mitigation

100 papers

Research addressing VLM reliability, including hallucination detection and mitigation, adversarial robustness, reward modeling for human alignment, and safety against jailbreak attacks.

HA-DPO POPE POVID RoVRM

Video Understanding & Temporal Reasoning

71 papers

Long video comprehension, streaming video reasoning, and temporal understanding using VLMs, addressing challenges of context length, frame selection, and temporal consistency.

LongVILA ReAgent-V Think-as-You-See MovieChat

💡 Key Insights

💡 Reinforcement learning post-training enables 7B models to surpass GPT-4o on visual reasoning benchmarks.

💡 Fixed-resolution encoding destroys critical visual detail; dynamic resolution yields 30%+ accuracy gains on text-heavy tasks.

💡 Most VLM benchmarks contain samples solvable without images, inflating reported capabilities.

💡 Pixel-space reasoning with active visual tools outperforms text-only chain-of-thought on fine-grained perception.

💡 Data curation with 5% of training data can match or exceed full-dataset performance when guided by capability analysis.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from building foundational architectures (2023) through rigorous evaluation and efficiency optimization (2024) to RL-driven reasoning breakthroughs and agentic self-improvement (2025-2026), with an accelerating trend toward embodied applications and zero-data self-evolution.

2023-01 to 2023-12 Foundation building: early VLM architectures, hallucination awareness, and first embodied applications

POPE (Evaluating Object Hallucination in Large..., 2023) established the foundational hallucination evaluation benchmark, revealing that LVLMs answer 'yes' to 99% of object queries
(VoxPoser, 2023) pioneered LLM-synthesized 3D value maps for zero-shot robotic manipulation
(MMBench, 2023) introduced circular evaluation and LLM-based choice extraction for robust VLM assessment
(Beyond Hallucinations, 2023) reframed hallucination elimination as a preference optimization task

2024-01 to 2024-12 Architecture breakthroughs and rigorous evaluation: dynamic resolution, efficient VLMs, and exposing benchmark weaknesses

MMStar (Are We on the Right..., 2024) revealed that models like GeminiPro outperform random choice by 24% without accessing any visual input, exposing severe data leakage in benchmarks
Qwen2-(Qwen2-VL, 2024) introduced Naive Dynamic Resolution with M-RoPE, achieving 93.8% on DocVQA and surpassing GPT-4o on MathVista by 6.7%
(NVILA, 2024) demonstrated scale-then-compress architecture reducing training costs by 1.9-5.1x while matching leading open VLMs
(LongVILA, 2024) scaled long-context VLMs to handle long videos through distributed training innovations
(VisionArena, 2024) collected 230K real-world user-VLM conversations with preference labels for human-aligned evaluation

🔀 Shift from fixed-resolution visual encoding to dynamic, native-resolution processing (Qwen2-VL, NVILA), enabling unified image-video understanding.

2025-01 to 2025-12 RL-driven reasoning revolution and agentic VLMs: visual chain-of-thought, embodied intelligence, and data-efficient self-improvement

(SoTA, 2025) achieved 75.1% on MathVista with a 7B model using only 11k samples via MCTS-guided reinforcement fine-tuning, surpassing GPT-4o (63.8%)
(Pixel Reasoner, 2025) introduced pixel-space reasoning with curiosity-driven RL, outperforming Gemini-2.5-Pro on V* Bench (84.3% vs 79.2%)
(Capability-Attributed, 2025) surpassed full-data training using only 5% of data by analyzing intrinsic model capabilities
(EditReward, 2025) achieved 65.72% accuracy on GenAI-Bench, outperforming GPT-5 (59.61%) for image editing reward modeling
(Safety at Scale, 2025) provided the first unified safety taxonomy across modalities, analyzing 574 papers

🔀 Reinforcement learning emerged as the dominant post-training paradigm, enabling VLMs to reason in pixel space, use visual tools, and self-improve with minimal data.

2026-01 to 2026-03 Self-evolving systems and diagnostic deep-dives: zero-data evolution, entity tracking, and mechanistic understanding

(MM-Zero, 2026) demonstrated self-evolving VLMs from zero data using a tri-role Proposer-Coder-Solver framework
SGCoT (Can VLMs Solve the Shell Game?, 2026) exposed that frontier VLMs perform at random chance on entity tracking and introduced Spatiotemporal Grounded Chain-of-Thought achieving >90% accuracy
(DatBench, 2026) achieved 13x evaluation speedup while revealing 35-point accuracy drops when converting from multiple-choice to generative evaluation
(Meissa, 2026) matched proprietary frontier agents in 10 of 16 medical settings while being 22x faster and using 25x fewer parameters

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Dynamic Resolution VLM Architectures	Replace fixed-resolution visual encoding with dynamic patch sequences and multimodal positional embeddings (e.g., M-RoPE) that decompose into spatial and temporal dimensions.	Improves on fixed-resolution baselines by +6.7% on MathVista (Qwen2-VL-72B vs GPT-4o) and achieves 93.8% on DocVQA, setting new state-of-the-art for document understanding.	Qwen2-VL (2025), NVILA (2024), Pixtral 12B (2024), What matters when building vision-language... (2024)
Reinforcement Learning for Visual Reasoning	Use reinforcement learning with verifiable rewards to train VLMs to reason step-by-step, combining text-based chain-of-thought with active visual inspection tools.	ThinkLite-VL-7B achieves 75.1% on MathVista, surpassing GPT-4o (63.8%) and Qwen2.5-VL-72B (71.9%) using only 11k training samples.	SoTA with Less (2025), Pixel Reasoner (2025), OpenThinkIMG (2025), DualMindVLM (2025)
Vision-Language-Action Models for Embodied AI	Bridge perception and action by using VLMs as high-level planners that output structured action representations (3D value maps, meta-actions, waypoints) for low-level controllers.	VoxPoser enables zero-shot robotic manipulation from language commands; SimLingo achieves state-of-the-art closed-loop autonomous driving with language-action alignment.	VoxPoser (2023), SimLingo (2025), Poutine (2025), Interactive Post-Training for Vision-Language-Action Models (2025)
Hallucination Mitigation & Human Alignment	Train models to prefer visually grounded outputs over plausible-sounding fabrications using preference pairs, or identify and suppress attention heads that copy incorrect prompt information.	HA-DPO improves MiniGPT-4 POPE accuracy from 51.13% to 86.13% (+35 points); POVID reduces CHAIR hallucination score from 66.8 to 31.8 on LLaVA-1.5.	Evaluating Object Hallucination in Large... (2023), Beyond Hallucinations (2023), Aligning Modalities in Vision Large... (2024), Mechanisms of Prompt-Induced Hallucination in... (2026)
Visual Token Compression & Efficient Inference	Select the most informative visual tokens using object-centric attention, optimal transport, or context-aware resolution prediction, then discard redundant tokens before language model processing.	OC-VTP retains only 11.1% of visual tokens on LLaVA-1.5 while maintaining 95.5% performance, achieving 17x reduction in prefill FLOPs.	OC-VTP (2025), PACT (2025), VLM-Pruner (2025), InfiniteVL (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MathVista	Accuracy (%)	79.7%	SoTA with Less (2025)
DocVQA	Accuracy (%)	93.8%	Qwen2-VL (2025)
V* Bench	Accuracy (%)	84.3%	Pixel Reasoner (2025)
POPE (Adversarial)	F1 Score	86.13% accuracy (HA-DPO on MiniGPT-4)	Beyond Hallucinations (2023)
MMStar	Accuracy (%)	57.1%	Are We on the Right... (2024)

⚠️ Known Limitations (4)

VLMs systematically hallucinate by prioritizing language priors over visual evidence, especially for high-count objects and specific prompt phrasings, undermining reliability in safety-critical applications. (affects: Dynamic Resolution VLM Architectures, Reinforcement Learning for Visual Reasoning)
Potential fix: Identify and ablate specific prompt-copying attention heads (PIH-heads); use preference optimization with synthetic negative examples (HA-DPO, POVID); apply contrastive decoding to suppress outlier token attention (DAMRO).
VLMs lack genuine spatial reasoning and object permanence, performing near random chance on entity tracking tasks when visual shortcuts are removed. (affects: Dynamic Resolution VLM Architectures, Vision-Language-Action Models for Embodied AI)
Potential fix: Spatiotemporal Grounded Chain-of-Thought (SGCoT) forces explicit coordinate tracking; blueprint-based spatial reasoning constructs structured representations before answering; external 3D tools (depth estimation, point clouds) augment visual understanding.
Benchmark evaluations are often unreliable due to data leakage, multiple-choice format inflation, and text-solvable questions, making it difficult to measure true VLM progress. (affects: Dynamic Resolution VLM Architectures, Reinforcement Learning for Visual Reasoning)
Potential fix: Use generative evaluation instead of multiple-choice (DatBench); filter benchmarks for visual necessity (MMStar); employ circular evaluation to detect position bias (MMBench); collect real-world user preference data (VisionArena).
Visual token processing accounts for 95-99% of compute in VLMs, creating a severe efficiency bottleneck for high-resolution and long-video inputs that limits deployment on edge devices. (affects: Dynamic Resolution VLM Architectures, Visual Token Compression & Efficient Inference)
Potential fix: Object-centric token pruning (OC-VTP) retains 11% of tokens at 95.5% performance; context-aware resolution selection (CARES) reduces tokens by 70-80%; hybrid linear-attention architectures (InfiniteVL) achieve constant-memory streaming inference.

📚 View major papers in this topic (10)

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2025-05) 9
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement (2025-04) 9
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning (2025-05) 9
NVILA: Efficient Visual Language Models from Pre-training to Deployment (2024-12) 9
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models (2023-07) 9
Are We on the Right Way for Evaluating Large Vision-Language Models? (2024-04) 9
CADC: Capability-Attributed Data Curation (2025-10) 9
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing (2025-09) 9
Can Vision-Language Models Solve the Shell Game? (2026-03) 9
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning (2025-05) 9

💡 Diving deeper into Vision-Language Understanding, let's examine specific research threads that define this area.

🎯

Visual Question Answering

What: Visual Question Answering (VQA) requires models to answer natural language questions about visual inputs by integrating perception, reasoning, and external knowledge.

Why: Enabling machines to understand and reason about visual content is critical for accessibility, autonomous systems, medical diagnosis, and human-AI interaction.

Baseline: Standard Vision-Language Models encode images globally with a frozen vision encoder and generate answers directly via a language model without structured reasoning.

Complex multi-step reasoning requiring compositional logic, spatial understanding, and domain-specific knowledge integration
Hallucination and robustness failures where models generate plausible but incorrect answers based on language priors rather than visual evidence
Scaling to high-resolution, multi-image, and long-video inputs while maintaining computational efficiency on resource-constrained devices

🧪 Running Example

❓ Looking at this restaurant receipt photo, what is the total cost of the items circled in red?

Baseline: A standard VLM processes the receipt at low resolution, missing small text. It may hallucinate a plausible total based on common receipt patterns rather than reading the actual numbers, or fail to identify which items are circled.

Challenge: This example requires OCR (reading small text), spatial reasoning (identifying circled items), and arithmetic (summing prices) — a composition of perception, localization, and multi-step logic that baseline models handle poorly.

✅ Multimodal Chain-of-Thought Reasoning: Decomposes the task into explicit stages: first summarize the receipt layout, then identify circled items via visual attention, then extract prices, and finally compute the sum — catching errors at each step.

✅ Visual Program Execution: Generates executable Python code that calls an OCR tool to extract text, a detection model to find circled regions, and standard arithmetic to compute the total — delegating each sub-task to the best-suited tool.

✅ High-Resolution Efficient Architectures: Processes the receipt at full resolution by splitting it into overlapping patches with dedicated adapters, ensuring small text and fine spatial details are preserved without overwhelming the model.

📈 Overall Progress

Visual Question Answering has undergone three major paradigm shifts: from end-to-end neural models (2017-2022) to compositional program execution and structured reasoning (2023-2024), and most recently to reinforcement learning-driven emergent reasoning without human supervision (2025-2026). The field has simultaneously expanded from generic VQA to expert-level domain-specific applications, while benchmark difficulty has escalated dramatically — top models still trail humans by 20-30% on challenging benchmarks like MMMU and MME-RealWorld.

📂 Sub-topics

Chain-of-Thought and Multi-Step Reasoning

30 papers

Methods that decompose visual question answering into structured, multi-step reasoning processes — including Chain-of-Thought prompting, stage-wise generation, and visual grounding of intermediate reasoning steps.

Multimodal Chain-of-Thought Stage-Wise Reasoning Visual Chain-of-Thought Compositional CoT

Reinforcement Learning for Visual Reasoning

22 papers

Approaches using reinforcement learning — particularly Group Relative Policy Optimization (GRPO) and verifiable rewards — to elicit reasoning capabilities in VLMs without requiring human-annotated reasoning traces.

GRPO-based Visual Reasoning Test-Time RL Verifiable Reward Training

Domain-Specific Visual Question Answering

45 papers

Adapting VQA to specialized domains including medical imaging, autonomous driving, remote sensing, and agriculture, where general-purpose models lack domain knowledge and fine-grained perception.

Medical VQA Driving VQA Remote Sensing VQA Domain Adaptation

VQA Benchmarks and Model Evaluation

38 papers

Creation of comprehensive benchmarks for evaluating VLMs across diverse dimensions including expert-level reasoning, cultural understanding, spatial awareness, factuality, and robustness to adversarial inputs.

Expert-Level Benchmarking Cultural Evaluation Factuality Assessment Adversarial Testing

Efficient Architectures and Training Recipes

30 papers

Innovations in VLM architecture design — including high-resolution processing, Mixture-of-Experts, hybrid Mamba-Transformer models, and data-centric training strategies — to improve capability and efficiency.

High-Resolution Processing Mixture-of-Experts Data-Centric Training Small VLMs

Visual Programming and Tool-Augmented VQA

12 papers

Methods that translate visual queries into executable programs or invoke specialized tools, leveraging code LLMs as reasoning engines and pre-trained vision models as perception modules.

Visual Program Execution Tool-Augmented Reasoning Procedural Video Querying

Robustness, Safety, and Hallucination Mitigation

17 papers

Research on understanding and mitigating VLM failures including adversarial vulnerability, visual hallucinations, sycophancy, spurious biases, and developing safety guardrails for multimodal content.

Adversarial Defense Hallucination Mitigation Safety Guardrails Bias Detection

💡 Key Insights

💡 Reinforcement learning elicits emergent visual reasoning without human-annotated traces

💡 Even top models trail human experts by 20-30% on expert-level benchmarks

💡 Small 2-3B models can outperform 70B+ models when trained with RL-based reasoning

💡 Visual programming via code generation enables zero-shot compositional reasoning

💡 Chain-of-Thought prompting sometimes degrades performance on spatial and visual tasks

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from monolithic predict-the-answer models toward modular, reasoning-aware systems that decompose perception from logic. The dominant trend in 2025-2026 is replacing supervised fine-tuning with reinforcement learning (GRPO) to elicit reasoning, alongside increasing specialization for high-stakes domains like medicine and autonomous driving.

2017-01 to 2023-06 Foundations of Visual Question Answering and early multimodal integration

(MemexQA, 2017) introduced personal photo collection QA, treating VQA as retrieval-then-inference over dynamic multimodal collections
(Multimodal Chain-of-Thought Reasoning, 2023) pioneered two-stage rationale generation for visual reasoning, achieving 85.31% on ScienceQA with a sub-1B model
ViperGPT (Visual Inference via Python Execution, 2023) established the visual programming paradigm, translating queries to executable Python code with pre-trained vision tools
(Multi-Modal, 2023) introduced multi-modal in-context examples at scale, creating the Otter model with highest human-evaluated Elo rating

2023-07 to 2024-06 Benchmark creation era and architectural innovations for high-resolution and compositional reasoning

MMMU (Massive Multi-discipline Multimodal Understanding, 2023) created the definitive expert-level benchmark with 11.5K questions across 30 subjects, where GPT-4V achieves only 55.7% vs human 88.6%
(Evaluating Mathematical Reasoning, 2023) unified 28 visual-math datasets, revealing GPT-4V trails humans by 10.4 points at 49.9% accuracy
CogAgent (A Visual Language Model for..., 2023) introduced a high-resolution cross-module for GUI understanding, achieving SOTA on AITW with >50% FLOPs reduction
(Scaling Human-Labeled Tasks, 2024) demonstrated that diverse human-labeled task training followed by minimal GPT-4 alignment achieves superior generalization
SpatialVLM (Endowing VLMs with Spatial Reasoning, 2024) generated 2 billion synthetic 3D spatial VQA pairs, enabling VLMs to estimate metric distances where GPT-4V produces valid numbers only 1% of the time

🔀 Shift from simple captioning-based VQA to expert-level multimodal reasoning benchmarks (MMMU, MathVista) that exposed massive gaps between model and human performance, driving the field toward structured reasoning.

2024-07 to 2025-06 Chain-of-Thought revolution, reinforcement learning for reasoning, and domain specialization

LLaVA-CoT (Let Vision Language Models Reason Step-by-Step, 2024) introduced stage-wise retracing search, surpassing GPT-4o-mini on multimodal reasoning benchmarks
R1-Zero (Visual Reasoning on a 2B..., 2025) first replicated DeepSeek R1's 'aha moment' in multimodal setting, applying GRPO directly to base VLMs without SFT
MedVLM-R1 (Medical VLM via Reinforcement Learning, 2025) boosted medical VQA accuracy from 55.11% to 78.22% using only 600 samples and GRPO, outperforming 72B models
MME-RealWorld (A Benchmark for MLLM in..., 2025) created a high-resolution human-annotated benchmark where state-of-the-art models fail to surpass 60% accuracy
MM1.5 (Data-Centric, 2024) provided the definitive data-centric training recipe, achieving 91.0 on DocVQA surpassing GPT-4V

🔀 Emergence of RL-based reasoning (GRPO/RLVR) as a replacement for supervised fine-tuning, enabling models to develop emergent reasoning without human-annotated traces — the 'aha moment' paradigm.

2025-07 to 2026-03 Test-time adaptation, agentic reasoning, and holistic multimodal unification

(Test-Time, 2025) pioneered RL at inference time on unlabeled data, improving ImageNet-Sketch by +52.4% and outperforming GPT-4o on classification
Hulu-Med (Transparent Generalist Medical VLM, 2025) unified text, 2D/3D images, and video understanding in a single medical architecture, surpassing GPT-4o on 16 of 30 benchmarks
(Structured Visual Chain-of-Thought, 2025) provided the first large-scale expert-annotated medical reasoning dataset with bounding-box grounded reasoning steps
CRAG-MM (Comprehensive RAG Benchmark for Multi-modal Multi-turn, 2025) established realistic wearable VQA evaluation where even GPT-5 achieves only 63% accuracy with 31% hallucinations
(Logic-Driven, 2026) introduced Logical Consistency Reward that penalizes reasoning drift, improving reasoning accuracy by +19.65%

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Multimodal Chain-of-Thought Reasoning	Separate rationale generation from answer inference, injecting dense vision features into a two-stage framework that first explains, then concludes.	Improves on direct answer prediction by +3.68% on ScienceQA (85.31% vs 81.63% text-only CoT), and LLaVA-CoT surpasses GPT-4o-mini and Gemini-1.5-Pro on average across 6 benchmarks.	Multimodal Chain-of-Thought Reasoning in Language... (2023), LLaVA-CoT (2024), DDCoT (2023), Compositional Chain-of-Thought Prompting for Large... (2023)
Reinforcement Learning for Visual Reasoning	Bypass supervised fine-tuning entirely, applying RL with simple accuracy and format rewards directly on base VLMs to induce spontaneous multi-step reasoning.	VisualThinker-R1-Zero achieves 59.47% on CVBench, outperforming the SFT-tuned Qwen2-VL-2B by ~2% and the base model by ~30%. TTRV improves ImageNet-Sketch by +52.4% at test time.	R1-Zero's 'Aha Moment' in Visual... (2025), TTRV (2025), Med-R1 (2025), Game-RL (2025)
Visual Program Execution	Replace end-to-end neural inference with LLM-generated code that orchestrates pre-trained vision models via a simple API, enabling zero-shot compositional reasoning.	ViperGPT achieves 72.0% on RefCOCO zero-shot, outperforming GLIP by +17.0%, and surpasses the 80B Flamingo on OK-VQA (51.9%) despite being zero-shot. ProViQ improves ActivityNet-QA by +25% over prior zero-shot methods.	ViperGPT (2023), Visual Program Distillation (2023), Zero-Shot (2023)
Data-Centric Multimodal Instruction Tuning	Maximize task diversity and data quality through systematic curation, using multi-modal in-context examples and two-stage tuning to balance capability and alignment.	Vision-Flan achieves +3.1 on MM-Bench and +6.5 on MME over LLaVA-1.5 while maintaining 84.0% on catastrophic forgetting benchmarks vs 73.3%. MIMIC-IT's Otter model achieves highest Elo rating (1014.7) on Multi-Modality Arena.	MIMIC-IT (2023), Vision-Flan (2024), MM1.5 (2024)
High-Resolution Efficient VLM Architectures	Decouple high-resolution detail capture from the main model pathway using lightweight branches, token compression, or linear-complexity layers to scale visual context affordably.	CogAgent achieves SOTA on AITW and 9 VQA benchmarks while reducing FLOPs by >50% vs scaling standard models to 1120×1120. LLaVA-Phi (3B) outperforms 7B+ models on ScienceQA. LongLLaVA processes ~1000 images on a single A100.	CogAgent (2023), LLaVA-Phi (2024), LongLLaVA (2024), Kimi-VL (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MMMU (Massive Multi-discipline Multimodal Understanding)	Accuracy (%)	64.0%	Kimi-VL (2025)
MathVista (Mathematical Reasoning in Visual Contexts)	Accuracy (%)	80.1%	Kimi-VL (2025)
ScienceQA (Science Question Answering)	Accuracy (%)	85.31%	Multimodal Chain-of-Thought Reasoning in Language... (2023)
OK-VQA (Outside Knowledge Visual Question Answering)	Accuracy (%)	62.0%	Bootstrapping Large Language Models with... (2026)
MME-RealWorld (Real-World Multimodal Evaluation)	Accuracy (%)	<60%	MME-RealWorld (2025)

⚠️ Known Limitations (4)

Hallucination and factual unreliability: VLMs frequently generate plausible but incorrect answers based on language priors rather than visual evidence, especially for rare entities or fine-grained details (affects: Multimodal Chain-of-Thought Reasoning, Data-Centric Multimodal Instruction Tuning)
Potential fix: Bottom-up reasoning with scene graph verification, contrastive self-training (VC-STaR), visual attention amplification, and Image-DPO training to penalize text-prior-based guessing
Domain transfer gap: Models trained on internet-scale data struggle severely in specialized domains (medical, remote sensing, ancient documents) due to missing fine-grained perceptual and knowledge requirements (affects: Data-Centric Multimodal Instruction Tuning, High-Resolution Efficient VLM Architectures)
Potential fix: Domain-specific pre-training with curriculum strategies, specialized vision encoders (MedSigLIP), cross-spectral bridging (GRAFT), and GRPO-based domain adaptation requiring minimal annotated samples
Adversarial vulnerability: VLMs are highly susceptible to adversarial visual perturbations that bypass safety alignment, and Chain-of-Thought reasoning provides only marginal robustness improvements (affects: Multimodal Chain-of-Thought Reasoning, High-Resolution Efficient VLM Architectures)
Potential fix: Adversarial pre-training at web scale (Δ-CLIP), double visual defense combining adversarial pre-training with adversarial instruction tuning, and ECSO's training-free image-to-text transformation for safety restoration
Cultural and linguistic bias: Models are predominantly Western/English-centric, with massive performance gaps on non-Western cultural concepts and low-resource languages (up to 30+ percentage point drops) (affects: Data-Centric Multimodal Instruction Tuning, Reinforcement Learning for Visual Reasoning (GRPO/RLVR))
Potential fix: Native-language dataset construction (AMCrawl for Arabic), culturally sourced benchmarks (CulturalVQA, K-Viscuit), and scalable multilingual chart generation via code decoupling

📚 View major papers in this topic (10)

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2023-11) 9
ViperGPT: Visual Inference via Python Execution for Reasoning (2023-03) 9
CogAgent: A Visual Language Model for GUI Agents (2023-12) 9
MIMIC-IT: Multi-Modal In-Context Instruction Tuning (2023-06) 9
R1-Zero's 'Aha Moment' in Visual Reasoning on a 2B Non-SFT Model (2025-03) 9
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts (2023-10) 9
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding (2025-10) 9
TTRV: Test-Time Reinforcement Learning for Vision Language Models (2025-10) 9
Double Visual Defense: A Novel Adversarial Defense for Vision-Language Models (2025-02) 9
S-Chain: Structured Visual Chain-of-Thought for Medicine (2025-10) 9

💡 Within the same paradigm, another important research direction focuses on Visual Grounding and Object Detection.

🔄

Visual Grounding and Object Detection

What: Visual grounding connects natural language descriptions to specific regions, objects, or temporal segments in images, videos, and 3D scenes, enabling precise spatial localization.

Why: Reliable grounding is essential for embodied agents, GUI automation, medical diagnosis, and autonomous driving where imprecise localization leads to catastrophic failures.

Baseline: Standard approaches use independent visual and text encoders with cross-modal fusion decoders to regress bounding box coordinates from image-text pairs.

Models often hallucinate objects or ignore visual evidence, relying on language priors instead of genuinely grounding predictions in image content
Precise coordinate generation is brittle for language-centric architectures, especially with small objects, high-resolution screens, and cluttered scenes
Extending grounding from 2D images to 3D scenes, temporal video segments, and interactive environments requires multi-step spatial reasoning

🧪 Running Example

❓ Find the small red mug behind the laptop on the cluttered desk in this office photo.

Baseline: A standard VLM processes the entire image globally and predicts coordinates as text tokens (e.g., '[0.45, 0.62, 0.52, 0.70]'). It often selects the most salient mug in the scene rather than the one specifically behind the laptop, because the independent encoders lack text-conditioned visual attention.

Challenge: This example requires: (1) resolving spatial relations ('behind the laptop'), (2) distinguishing among multiple similar objects ('small red mug' vs. other mugs), and (3) attending to a small region in a cluttered scene where language priors may override visual evidence.

✅ Visual Reinforcement Fine-Tuning: Trains the model with IoU-based verifiable rewards, so it learns from geometric feedback that its box must overlap the correct mug — not just produce plausible coordinates.

✅ Grounded Visual Chain-of-Thought: The model first outputs 'I see a laptop at [0.2, 0.3, 0.6, 0.7]' and then reasons 'behind it means the region at [0.45, 0.6...]', explicitly grounding each reasoning step in visual coordinates before the final answer.

✅ Coordinate-Free Attention Grounding: Instead of generating numeric coordinates, the model uses an attention head that directly maps to the visual patch containing the red mug, bypassing the brittle text-to-coordinate generation entirely.

📈 Overall Progress

The field has undergone a paradigm shift from static supervised coordinate regression to dynamic RL-based policy optimization, where models actively search images and ground reasoning in visual evidence. Early work established contrastive pretraining and high-resolution architectures as foundations, while recent advances demonstrate that small RL-trained models (3-7B) can outperform much larger supervised models (72B+) on precision grounding tasks. The convergence of GUI grounding, 3D spatial reasoning, and medical/remote sensing applications shows grounding becoming a universal capability rather than a niche task.

📂 Sub-topics

Reinforcement Learning for Visual Perception & Grounding

45 papers

Applies reinforcement learning with verifiable rewards (RLVR) — such as IoU scores and format checks — to train VLMs for precise visual localization, replacing supervised fine-tuning with policy optimization that directly optimizes geometric metrics.

Visual-RFT Perception-R1 GRPO-based Visual RL VLM-R1

Grounded Visual Reasoning & Chain-of-Thought

40 papers

Methods that interleave textual reasoning steps with explicit visual references (bounding boxes, cropped regions, visual tokens) to anchor multi-step reasoning in spatial evidence rather than relying on text-only chains.

Visual CoT Grounded Reasoning (GRIT) Region-Conditioned RL (VLM-R3) ViGoRL

GUI & Screen Agent Grounding

25 papers

Focuses on precisely localizing UI elements (buttons, text fields, icons) in high-resolution screenshots for GUI automation, addressing challenges like visual clutter, tiny targets, and the mismatch between dense pixel coordinates and language tokens.

CogAgent GUI-Actor SE-RFT UI-AGILE

3D Scene Understanding & Visual Grounding

25 papers

Extends visual grounding from 2D images to 3D scenes, leveraging multi-view reasoning, point clouds, and Bird's Eye View representations to localize objects in physical environments for embodied AI.

VLM-Grounder GPT4Scene Agent3D-Zero EmbodiedScan

Open-Vocabulary Detection & Segmentation

35 papers

Leverages vision-language pretraining (CLIP, SigLIP) to detect and segment arbitrary objects described in natural language, enabling zero-shot generalization beyond fixed training categories.

Grounding-DINO PaLI-3/PaLI-X FLAIR PSALM

Temporal Video Grounding

15 papers

Localizes specific temporal segments in videos given natural language queries, combining visual understanding with temporal reasoning using RL-based optimization of timestamp predictions.

Time-R1 TAR-TVG VideoRFT

Domain-Specific Grounding (Medical, Remote Sensing, Robotics)

40 papers

Adapts visual grounding to specialized domains requiring domain knowledge, including medical image grounding for radiology, satellite imagery analysis, and robotic manipulation with physical reasoning.

MedGround-R1 GeoChat GRAFT PG-InstructBLIP

Hallucination Mitigation & Visual Grounding Reliability

24 papers

Addresses the fundamental reliability challenge where VLMs generate plausible but visually ungrounded outputs, developing detection methods, evaluation benchmarks, and mitigation techniques to ensure outputs are anchored in visual evidence.

M3ID PROVE DASH Visual Evidence Augmentation

💡 Key Insights

💡 RL with geometric rewards enables small 3-7B models to outperform 72B supervised models on grounding

💡 Grounded chain-of-thought anchoring reasoning in bounding boxes dramatically reduces visual hallucination

💡 Zero-shot 3D grounding via multi-view 2D reasoning rivals supervised methods trained on 3D data

💡 Coordinate-free attention grounding eliminates brittle text-to-number generation for GUI agents

💡 Contrastive backbones like SigLIP are now the standard foundation for all grounding tasks

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from improving vision-language alignment quality (2023-2024) through data-centric training recipes (2024-2025) to RL-driven active visual reasoning (2025-2026), with increasing emphasis on grounded chain-of-thought that anchors every reasoning step in spatial evidence.

2023-09 to 2024-06 Foundation architectures and large-scale datasets for grounded vision-language models

PaLI-3 (PaLI-3, 2023) demonstrated that SigLIP contrastive backbones massively outperform classification-pretrained ViTs for localization, achieving +23.7% on RefCOCOgs
(CogAgent, 2023) pioneered dual-resolution visual processing for GUI understanding, enabling high-resolution screen grounding with >50% fewer FLOPs
PaLI-X (On Scaling up a Multilingual..., 2024) jointly scaled vision and language to achieve 86.0 on VQAv2 with integrated OCR pretraining
3D-GRAND (3D-GRAND: A Million-Scale Densely-Grounded 3D-LLM Dataset, 2024) introduced million-scale densely grounded 3D data, outperforming prior 3D-LLMs by +7.7% on ScanRefer

🔀 Shift from classification-pretrained vision backbones to contrastively pretrained encoders (SigLIP/CLIP) as the standard for grounding tasks.

2024-07 to 2025-03 Data-centric training recipes, efficient grounding architectures, and alignment optimization

(VLM, 2024) introduced text-conditioned attention pooling, outperforming CLIP by +14.4% mIoU on zero-shot segmentation with 100x less data
SimVG (A Simple Framework for Visual..., 2024) achieved 94.46% accuracy on RefCOCO testA via dynamic weight-balance distillation, training in 12 hours on a single GPU
(Visual Reinforcement Fine-Tuning, 2025) launched the RL-for-vision paradigm by extending DeepSeek-R1 style training to visual tasks, improving mAP from 9.8 to 31.3 on COCO
MM1.5 (Methods, Analysis & Insights from..., 2024) established the data-centric recipe for balancing OCR, grounding, and general capabilities across training stages

2025-04 to 2026-03 Reinforcement learning revolution for visual grounding with grounded chain-of-thought reasoning

Perception-R1 (Perception-R1, 2025) became the first pure MLLM to surpass 30% mAP on COCO using bipartite matching rewards
ViGoRL (Grounded Reinforcement Learning for Visual Reasoning, 2025) combined MCTS-guided training with active visual search, achieving 86.4% on V*Bench and outperforming proprietary models
(Coordinate-Free, 2025) eliminated coordinate generation entirely using attention-based grounding, with 7B model outperforming 72B UI-TARS
Molmo2 (Open Weights and Data for..., 2026) provided the first fully open video grounding pipeline with tracking and pointing, outperforming Gemini 2.5 Pro on ReasonVOS

🔀 Fundamental shift from supervised coordinate regression to RL-based policy optimization with verifiable visual rewards, enabling models to actively search and reason over images.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Visual Reinforcement Fine-Tuning	Replace supervised imitation with Group Relative Policy Optimization (GRPO) using geometric rewards (IoU, mAP) to directly optimize visual perception and grounding.	Improves on SFT baselines by +21.5 mAP on COCO open-vocabulary detection (Visual-RFT), achieving 31.3 mAP. Perception-R1 surpasses 30% mAP threshold on COCO, the first pure MLLM to do so.	Visual-RFT (2025), Perception-R1 (2025), VLM-R1 (2025), Perception-Aware (2025)
Grounded Visual Chain-of-Thought Reasoning	Redefine each reasoning step as a tuple of text thought plus spatial coordinate, forcing the model to 'point and look' at evidence while thinking.	ViGoRL improves on vanilla GRPO by +12.9% accuracy on SAT-2 spatial reasoning benchmark, achieving 86.4% on V*Bench. Argus achieves 62.7 on MMVP, surpassing Gemini 1.5 Pro (61.3).	Grounded Reinforcement Learning for Visual... (2025), Argus (2025), GRIT (2025), VoCoT (2024)
Coordinate-Free & Attention-Based GUI Grounding	Use attention heads or patch-level scoring to directly map instructions to visual regions, bypassing the text-to-coordinate generation bottleneck.	GUI-Actor-7B achieves 44.6 on ScreenSpot-Pro, outperforming the much larger UI-TARS-72B (38.1). SE-RFT-7B achieves 47.3% on ScreenSpot-Pro, surpassing UI-TARS-72B by 24.2%.	CogAgent (2023), GUI-Actor (2025), Enhancing Visual Grounding for GUI... (2025), UI-AGILE (2025)
Zero-Shot 3D Visual Grounding via Multi-View VLM Reasoning	Reconceptualize 3D understanding as iterative 2D viewpoint selection and multi-view ensemble projection, bypassing scarce 3D-language datasets.	VLM-Grounder achieves 51.6% Acc@0.25 on ScanRefer, outperforming ZS3DVG by +15.2 points. SeqVLM achieves 55.6% Acc@0.25 on ScanRefer, surpassing previous zero-shot SOTA by +4.0%.	VLM-Grounder (2024), GPT4Scene (2025), Agent3D-Zero (2024), 3D-GRAND: A Million-Scale Densely-Grounded 3D-LLM... (2024)
Fine-Grained Vision-Language Alignment for Open-Vocabulary Detection	Condition image representations on specific text queries via attention pooling or pixel-level contrastive learning to capture local visual details.	PaLI-3 (5B) surpasses PaLI-X (55B) on 8 text understanding tasks. FLAIR outperforms CLIP by +14.4% mIoU on zero-shot segmentation despite using 100x less data. MM-Grounding-DINO improves original Grounding-DINO by +12.6 AP on LVIS.	PaLI-3 (2023), FLAIR (2024), An Open and Comprehensive Pipeline... (2024), On Scaling up a Multilingual... (2024)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
RefCOCO (testA)	Accuracy (IoU@0.5)	94.46%	SimVG (2024)
ScreenSpot-Pro	Accuracy (%)	47.3%	Enhancing Visual Grounding for GUI... (2025)
COCO Object Detection (mAP)	mAP (mean Average Precision)	31.9% mAP	Perception-R1 (2025)
ScanRefer (Zero-Shot, Acc@0.25)	Accuracy@0.25 IoU	55.6%	SeqVLM (2025)
V*Bench	Accuracy (%)	86.4%	Grounded Reinforcement Learning for Visual... (2025)

⚠️ Known Limitations (4)

Scale-driven bias in RL training causes models to ignore small but critical objects, as large visual regions dominate reward signals during optimization. (affects: Visual Reinforcement Fine-Tuning (Visual-RFT / GRPO for Vision), Grounded Visual Chain-of-Thought Reasoning)
Potential fix: Scale Relative Policy Optimization (SRPO) normalizes rewards within size bins so small regions compete fairly, as demonstrated by Ground-R1 with +11.9% improvement on V* benchmark.
Extended reasoning chains degrade visual grounding — longer thinking causes models to drift from image evidence and amplify hallucinations, a phenomenon termed 'more thinking, less seeing'. (affects: Grounded Visual Chain-of-Thought Reasoning, Visual Reinforcement Fine-Tuning (Visual-RFT / GRPO for Vision))
Potential fix: PEARL introduces a Fidelity Gate that halts reasoning policy updates when perception checks fail, and PeRL-VL decouples perception training from reasoning to prevent visual signal degradation.
GUI and high-resolution grounding methods are highly sensitive to image noise and visual perturbations, with Visual CoT methods showing higher fragility than standard VLMs in 70 out of 96 corrupted settings. (affects: Coordinate-Free & Attention-Based GUI Grounding, Grounded Visual Chain-of-Thought Reasoning)
Potential fix: Injecting high-confidence detection cues from external object detectors (like Grounding DINO) stabilizes intermediate visual steps and mitigates fragility of internal localization.
3D visual grounding still relies heavily on multi-view rendering or point clouds, creating computational bottlenecks and losing fine-grained details during 2D-to-3D projection. (affects: Zero-Shot 3D Visual Grounding via Multi-View VLM Reasoning)
Potential fix: Explicit 3D representations as reasoning interfaces (SpatialReasoner) that predict calibrated 3D vectors as intermediate steps, and video-based approaches (3D-RFT) that bypass point cloud processing.

📚 View major papers in this topic (10)

Grounded Reinforcement Learning for Visual Reasoning (2025-05) 9
CogAgent: A Visual Language Model for GUI Agents (2023-12) 9
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding (2026-01) 9
On Scaling up a Multilingual Vision and Language Model (PaLI-X) (2024-07) 9
Visual-RFT: Visual Reinforcement Fine-Tuning (2025-03) 8
Perception-R1: Pioneering Perception Policy with Reinforcement Learning (2025-04) 8
3D-GRAND: A Million-Scale Densely-Grounded 3D-LLM Dataset (2024-06) 9
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents (2025-06) 8
FLAIR: VLM with Fine-grained Language-informed Image Representations (2024-12) 8
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations (2024-06) 9

💡 Within the same paradigm, another important research direction focuses on Image Captioning.

🔍

Image Captioning

What: Image captioning generates natural language descriptions of visual content, bridging perception and language understanding in vision-language models.

Why: High-quality captions enable downstream tasks like visual reasoning, text-to-image generation, and accessible image understanding for diverse applications.

Baseline: Standard supervised fine-tuning trains models on human-annotated image-caption pairs using cross-entropy loss over ground-truth sequences.

Generating detailed, factually accurate descriptions without hallucinating objects or attributes not present in the image
Evaluating caption quality reliably when traditional metrics poorly correlate with human judgment for long, detailed descriptions
Scaling captioning across diverse visual domains including charts, documents, remote sensing, and specialized imagery

🧪 Running Example

❓ Generate a detailed caption for an image showing a busy farmer's market with various produce stalls, a musician playing guitar, and a child holding a red balloon.

Baseline: A standard SFT model might produce 'A busy outdoor market with people shopping' — generic and missing key details like the musician, specific produce types, and the child with the balloon, or it may hallucinate objects not present such as a dog near the stalls.

Challenge: This example illustrates three key challenges: (1) the model must describe many fine-grained details without hallucinating (e.g., inventing a dog not in the scene), (2) it must maintain narrative coherence across a complex scene with multiple focal areas, and (3) traditional metrics like BLEU cannot distinguish between a vague correct caption and a richly detailed accurate one.

✅ Reinforcement Learning for Dense Captioning: CapRL uses a blind LLM to answer visual questions from the caption alone — if the caption omits the musician or misidentifies the produce, the questions fail, providing a concrete reward signal to improve detail coverage without memorizing ground-truth phrasing.

✅ Inference-Time Search and Hierarchical Refinement: TDSR first generates a high-level blueprint ('outdoor market scene with three focal areas'), then progressively fills in details for each area using MCTS search, ensuring the musician, produce stalls, and child are all captured without losing global coherence.

✅ Hallucination-Aware Evaluation Frameworks: CapArena's pairwise human evaluation and ALOHa's open-vocabulary hallucination metric can reliably detect if the caption incorrectly mentions items not in the scene, unlike BLEU or CIDEr which may reward fluent but inaccurate descriptions.

✅ Knowledge-Augmented and Personalized Captioning: MsRAG detects the specific guitar brand or produce varieties via object-level retrieval from external databases, enriching the caption with factual entity-level knowledge beyond generic descriptions.

📈 Overall Progress

Image captioning has undergone two major paradigm shifts: first from traditional n-gram metrics to human-aligned evaluation frameworks based on atomic fact decomposition and pairwise comparison, and second from supervised fine-tuning to reinforcement learning with verifiable rewards. Modern systems achieve human-competitive detailed captioning (GPT-4o surpasses human baselines on CapArena) while simultaneously reducing hallucinations through inference-time search and multi-agent verification. The field has also expanded from natural-image-only captioning to unified multi-domain systems covering documents, charts, 3D scenes, and multilingual content.

📂 Sub-topics

Dense and Detailed Image Captioning

8 papers

Methods for generating comprehensive, fine-grained image descriptions that capture objects, attributes, spatial relations, and contextual details beyond simple one-sentence captions. This sub-topic focuses on training paradigms (especially RL) and inference strategies that improve caption richness and accuracy.

CapRL VisVM TDSR RubiCap

Captioning Evaluation and Benchmarks

5 papers

New metrics, benchmarks, and evaluation frameworks designed to accurately measure the quality, factuality, and comprehensiveness of detailed image captions generated by modern VLMs, moving beyond legacy n-gram-based metrics.

CapArena ALOHa CapMAS CompreCap

Domain-Specific and Multimodal Captioning

10 papers

Captioning systems tailored for specialized visual domains including remote sensing imagery, scientific figures, geometric diagrams, text-rich documents, manga narratives, and 3D scenes, addressing unique challenges each domain presents.

RS-MoE LaRA Monkey LL3DA

Knowledge-Augmented and Personalized Captioning

6 papers

Approaches that integrate external knowledge sources, retrieval-augmented generation, or user-specific concept databases to generate more informative and personalized captions that go beyond generic visual descriptions.

MsRAG RAP MyVLM Taxonomic RAG

Captioning for Downstream Applications

4 papers

Using image captioning as an intermediary text representation to enable tasks like video anomaly detection, safety filtering, physics question answering, and synthetic training data generation.

LAVAD ECSO MM-Gen

Robustness, Bias, and Training Paradigms

7 papers

Research addressing VLM robustness to adversarial attacks, social bias propagation, missing modalities, and novel training strategies including in-context learning, encoder-free architectures, and cross-tokenizer prompt optimization.

RLCF FUSE EVEv2.0 Scalable Diffusion

💡 Key Insights

💡 RL with verifiable rewards outperforms supervised fine-tuning for dense captioning quality and diversity

💡 Atomic fact decomposition enables reliable hallucination detection where traditional metrics fail

💡 Inference-time search with value networks achieves 74% human preference over greedy decoding

💡 Converting images to rich text enables text-only LLMs to match multimodal visual reasoning performance

💡 Single-image retrieval-based personalization matches or exceeds multi-image fine-tuning approaches

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from improving resolution handling and basic caption quality (2023) through evaluation innovation and personalization (2024) to RL-driven dense captioning and unified multi-domain systems (2025-2026), with increasing emphasis on factual accuracy over fluency and on converting visual perception into text to enable purely linguistic reasoning.

2023-05 to 2023-11 Foundation models and high-resolution captioning

(Test-Time, 2023) pioneered using CLIP similarity as a reward signal for test-time reinforcement learning adaptation of VLMs, outperforming TPT by 5.4% on average
Monkey (Image Resolution and Text Label..., 2023) introduced sliding-window patch processing enabling high-resolution captioning up to 1344×896 with per-patch LoRA adapters
LL3DA (Visual Interactive Instruction Tuning for Omni-3D, 2023) enabled 3D scene captioning by integrating visual prompts (clicks, bounding boxes) with point cloud processing via a multi-modal transformer

2024-01 to 2024-12 Evaluation revolution, personalization, and hallucination awareness

CompreCap (Comprehensive Image Captioning Benchmark, 2024) introduced directed scene graph evaluation for detailed captions with hierarchical object-attribute-relation matching
ALOHa (A New Measure for Hallucination, 2024) replaced fixed-vocabulary CHAIR with open-vocabulary LLM-based hallucination detection, improving +30.8% on out-of-domain objects
(Retrieval-Augmented, 2024) introduced a remember-retrieve-generate paradigm for personalized captioning with single-image concept learning, achieving 84.1 CIDEr
(Vision Value Model, 2024) demonstrated inference-time search with learned value networks, achieving 74% human preference over greedy decoding and +10.8% average improvement across 9 benchmarks
CapMAS (Caption Factuality Multi-Agent System, 2024) introduced decomposition-verification-revision for hallucination correction, identifying that MLLMs rely more on language priors as captions grow longer

🔀 The field shifted from optimizing traditional metrics (BLEU, CIDEr) to developing sophisticated evaluation frameworks based on atomic fact decomposition, directed scene graphs, and human preference modeling.

2025-01 to 2025-12 RL revolution, unified multi-domain captioning, and scalable evaluation

CapArena (Benchmarking Detailed Image Captioning, 2025) established Elo-based model ranking for detailed captioning with 94.3% automated correlation to human judgment, showing GPT-4o surpasses human baselines
(Painting with Words, 2025) reduced hallucinations by 40.5% using atomic decomposition-based RL rewards with a new DCScore metric achieving 0.90 Spearman correlation with VLM Arena
OmniCaptioner (One Captioner to Rule Them All, 2025) unified captioning across natural images, visual text, and structured visuals with a 21M dataset, enabling text-only LLMs to achieve state-of-the-art visual reasoning
CapRL (Dense Image Caption via RL, 2025) achieved caption quality comparable to Qwen2.5-VL-72B using a 7B model with perception-reasoning decoupled rewards
(Top-Down, 2025) reframed captioning as hierarchical planning with MCTS and a lightweight value network, reducing expensive VLM calls by an order of magnitude

🔀 Reinforcement learning with verifiable rewards emerged as the dominant training paradigm for dense captioning, replacing supervised fine-tuning and achieving quality comparable to models 10x larger.

2026-01 to 2026-03 Rubric-guided RL and multilingual expansion

(Rubric-Guided, 2026) replaced scalar rewards with committee-generated rubrics of binary checkable rules, enabling a 7B model to outperform GPT-4V and Qwen2.5-VL-72B in blind ranking
MUNIChus (Multilingual News Image Captioning, 2026) introduced the first multilingual news captioning benchmark covering 9 languages with 700K+ images, showing fine-tuned models more than double prompting-based performance

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Reinforcement Learning for Dense Captioning	Caption quality is measured by whether a text-only LLM can answer visual questions using only the generated caption, or via rubric-based binary checks, providing scalable reward signals.	CapRL improves on DenseFusion-1M by +6.8% accuracy on InfoVQA and +3.6% on ChartVQA; RubiCap-7B achieves +20.8% win-rate over base model on PixMoCap, outperforming GPT-4V and Qwen2.5-VL-72B in blind ranking.	Test-Time (2023), Painting with Words (2025), CapRL (2025), RubiCap (2026)
Inference-Time Search and Hierarchical Refinement	A trained value network predicts long-term caption quality using visual grounding signals, guiding tree search to explore multiple descriptive paths before committing to final output.	VisVM-guided captions are preferred 74% over greedy decoding in human evaluation, with +10.8% average improvement across 9 benchmarks for LLaVA-Next-7B; TDSR reduces VLM calls by an order of magnitude versus standard search.	Scaling Inference-Time Search with Vision... (2024), Top-Down (2025)
Unified Multi-Domain Captioning	Converting diverse visual inputs into rich textual descriptions via unified pipelines enables text-only LLMs to achieve visual reasoning without visual encoder training.	OmniCaptioner + DeepSeek-R1 achieves 40.5% on MathVerse, outperforming Qwen2-VL-7B (31.9%); Monkey improves +9.77% over Qwen-VL on document VQA; LaRA achieves +202 points on OCRBench over LLaVAR.	Monkey (2023), TRINS (2024), One Captioner to Rule Them... (2025), Enhancing Large Vision-Language Models with... (2025)
Knowledge-Augmented and Personalized Captioning	Retrieving entity-specific information from external databases and grounding it to detected visual regions produces contextually rich captions that go beyond pattern-matching descriptions.	RAP achieves 84.1 CIDEr on personalized captioning, outperforming MyVLM (76.8) and Yo'LLaVA (73.5); MsRAG outperforms standard mRAG by +21.9% CIDEr using GPT-4o on knowledge-intensive captioning.	MyVLM (2024), RAP (2024), MsRAG (2025)
Hallucination-Aware Evaluation Frameworks	Decomposing captions into atomic facts or objects enables fine-grained per-claim verification against the image, replacing holistic similarity scores that mask hallucinations.	CapArena-Auto achieves 94.3% correlation with human rankings, far surpassing traditional METEOR; ALOHa improves hallucination detection by +30.8% over CHAIR on out-of-domain objects (nocaps-FOIL).	CompreCap (2024), ALOHa (2024), Multimodal large language models excel... (2024), CapArena (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
CapArena (Detailed Captioning Elo Rating)	Elo Rating (higher is better)	~1195 Elo	CapArena (2025)
PixMoCap (Dense Captioning Win-Rate)	Win-Rate improvement (% preferred over baseline)	+20.8% win-rate improvement over base model	RubiCap (2026)
mmHal-V (Hallucination Benchmark)	Relative Hallucination Reduction (%)	40.5% relative hallucination reduction	Painting with Words (2025)
Personalized Image Captioning CIDEr	CIDEr (higher is better)	84.1 CIDEr	RAP (2024)
nocaps-FOIL (Out-of-Domain Hallucination Detection)	Improvement over CHAIR metric (%)	+30.8% improvement over CHAIR	ALOHa (2024)

⚠️ Known Limitations (4)

Hallucination in detailed captions: As models generate longer, more detailed descriptions, they increasingly rely on language priors rather than visual input, causing factual errors that compound over sequence length. (affects: Reinforcement Learning for Dense Captioning, Unified Multi-Domain Captioning)
Potential fix: Multi-agent verification (CapMAS), atomic fact decomposition with per-fact visual grounding (FeedQuill), inference-time search guided by visual similarity (VisVM), and rubric-based RL training (RubiCap).
Evaluation metric limitations: Traditional metrics like BLEU, CIDEr, and METEOR correlate poorly with human judgment for detailed captions, while newer LLM-based metrics are expensive and may introduce their own biases. (affects: Hallucination-Aware Evaluation Frameworks)
Potential fix: CapArena-Auto uses GPT-4o with reference captions to achieve 94.3% correlation with human rankings at lower cost; DCScore decomposes evaluation into verifiable atomic facts; CompreCap uses directed scene graphs for structured evaluation.
Domain gap across visual types: Models trained on natural images perform poorly on charts, documents, scientific figures, remote sensing imagery, and manga due to fundamentally different visual-linguistic patterns in each domain. (affects: Unified Multi-Domain Captioning, Knowledge-Augmented and Personalized Captioning)
Potential fix: Domain-specific mixture of experts (RS-MoE achieving 13B-level performance with 1B parameters), massive multi-domain training datasets (OmniCaptioner's 21M samples), and OCR-augmented architectures (LaRA) that explicitly feed text content to the LLM.
Vulnerability to adversarial perturbations and bias propagation: VLMs can be manipulated by imperceptible frequency-domain perturbations and exhibit systematic social biases that propagate from embeddings to downstream captioning and retrieval outputs. (affects: Unified Multi-Domain Captioning, Inference-Time Search and Hierarchical Refinement)
Potential fix: Frequency-domain robustness training, bias-aware calibration methods, and multi-model ensemble verification; larger models exhibit stronger bias propagation (Spearman ρ=0.88 for CLIP-L-14 vs 0.80 for CLIP-B-32), suggesting model scaling alone will not resolve this.

📚 View major papers in this topic (10)

💡 Within the same paradigm, another important research direction focuses on Document and Chart Understanding.

📋

Document and Chart Understanding

What: Research on enabling vision-language models to accurately read, parse, and reason over documents, charts, tables, and other visually structured content.

Why: Trillions of pages of knowledge are locked in PDFs, charts, and scanned documents that current AI systems cannot reliably extract or reason about.

Baseline: Traditional pipelines chain separate OCR, layout analysis, and text-based language models, losing visual context and propagating errors across stages.

Balancing fine-grained text recognition with global layout understanding across high-resolution, multi-page documents
Performing multi-step numerical and logical reasoning over chart data requiring precise visual grounding
Scaling to production deployment with compact models while handling diverse languages, scripts, and real-world distortions

🧪 Running Example

❓ According to the Q3 2024 earnings report, how did the company's cloud revenue compare to the analyst forecast shown in the bar chart on page 23?

Baseline: A traditional OCR pipeline extracts text from all 50 pages but misses the bar chart's projected values entirely, retrieves irrelevant pages about other revenue segments, and cannot cross-reference the textual revenue figure with the visual forecast.

Challenge: This example requires three capabilities current systems lack: (1) retrieving the correct page from a long document using visual cues, (2) extracting precise numerical values from a bar chart, and (3) performing comparative reasoning across text and chart modalities.

✅ Visual Document RAG: M3DocRAG encodes each page as a visual embedding and retrieves page 23 directly by matching the query to the chart image, preserving the bar chart's visual information that OCR would lose.

✅ Reinforcement Learning for Document Intelligence: olmOCR 2 uses unit-test rewards to train the model to correctly extract the exact revenue figure and chart values, optimizing for functional correctness rather than fuzzy text matching.

✅ Chart Reasoning via Grounding and Transfer: VisDoT decomposes the query into perception sub-questions (locate the Q3 bar, read its value) and logic sub-questions (compare to the text figure), grounding each step in specific visual regions of the chart.

✅ Compact End-to-End Document Parsers: GLM-OCR processes the page with layout-aware detection followed by a compact 0.9B VLM, accurately parsing both the text table and the chart into structured Markdown at production speed.

📈 Overall Progress

Document and chart understanding has undergone two major paradigm shifts: first, from text-based to visual-centric retrieval (2024), and second, from supervised fine-tuning to reinforcement learning with verifiable rewards (2025). The field has also demonstrated that compact sub-1B models with specialized architectures can match or exceed general-purpose models 100x their size, suggesting that task-specific design outweighs brute-force scaling for structured document tasks.

📂 Sub-topics

End-to-End OCR & Document Parsing

15 papers

Research on unified models that convert document images directly into structured text (Markdown, HTML, JSON) without brittle multi-stage pipelines, increasingly using reinforcement learning for optimization.

Reinforcement Learning for Document Intelligence Compact End-to-End Document Parsers

Chart Reasoning & Understanding

10 papers

Methods for extracting data from and performing complex multi-step reasoning over charts, graphs, and flowcharts, including numerical comprehension and cross-subchart inference.

Chart Reasoning via Grounding and Transfer Reinforcement Learning for Document Intelligence

Multi-Page Document Understanding & RAG

11 papers

Systems that retrieve and reason across multiple pages or documents using visual embeddings, dynamic retrieval strategies, and multi-agent architectures to answer complex questions.

Visual Document RAG Compact End-to-End Document Parsers

Document-Centric VLM Architectures

7 papers

Large-scale vision-language models specifically designed or adapted for document understanding through joint scaling of vision encoders and language decoders with document-specific training objectives.

Scaled Multi-Task VLMs for Documents

Benchmarks & Evaluation

14 papers

New evaluation frameworks and datasets that expose limitations of current models on real-world documents, including multilingual charts, ancient scripts, enterprise content, physical distortions, and agentic document navigation.

💡 Key Insights

💡 Reinforcement learning with verifiable rewards outperforms supervised fine-tuning for document OCR and chart reasoning.

💡 Sub-1B parameter models match 100x-larger VLMs on document parsing with specialized architecture design.

💡 Visual-centric retrieval outperforms text-based retrieval by 20%+ on layout-rich documents.

💡 Training on few complex reasoning examples transfers better than thousands of simple extraction tasks.

💡 State-of-the-art models still fail below 60% accuracy on real-world document benchmarks.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from scaling general VLMs with document data (2024) to specialized compact architectures trained with reinforcement learning (2025-2026), while benchmarks have shifted from academic multiple-choice formats to agentic, real-world evaluations testing multi-hop reasoning across modalities.

2024-02 to 2024-08 Foundation VLMs establish document understanding baselines through joint scaling and large-scale datasets

(SPHINX-X, 2024) simplified MLLM training into a single-stage paradigm with learnable skip tokens, covering 1.1B to MoE scales
ChartPaLI-5B (Chart-based Reasoning, 2024) pioneered transferring reasoning from LLMs to a 5B VLM, outperforming GPT-4V on ChartQA
PaLI-X (On Scaling up a Multilingual..., 2024) jointly scaled vision (ViT-22B) and language (32B) components to new SOTA on VQAv2 (86.0) and TextVQA (84.5)
Idefics3 (Building and better understanding vision-language models, 2024) released Docmatix, a 240x larger open document dataset, with +13.7 point DocVQA improvement

2024-09 to 2025-02 Visual-centric document RAG emerges alongside the first large-scale OCR distillation systems

M3(M3DocRAG, 2024) introduced visual-centric RAG, encoding pages as images and reducing retrieval latency from 20s to under 2s per query
(SV-RAG, 2024) reused the MLLM's own hidden states for visual retrieval, eliminating the need for separate encoders
olmOCR (olmOCR: Unlocking Trillions of Tokens..., 2025) enabled PDF processing at $176 per million pages via document-anchored distillation, 35x cheaper than GPT-4o
(MME-RealWorld, 2025) exposed massive gaps between academic and real-world performance, with GPT-4o failing to reach 60% accuracy

🔀 Shift from text-based to visual-centric document retrieval: treating pages as images for retrieval rather than extracting text first, preserving charts and layout information.

2025-03 to 2025-12 Reinforcement learning transforms OCR and chart reasoning; specialized benchmarks expose multilingual and domain-specific gaps

Chart-R1 (Chart-R1, 2025) combined code-based data synthesis with CoT-RL, surpassing GPT-4o on ChartQA with 83.9%
olmOCR 2 (olmOCR 2: Unit Test Rewards..., 2025) introduced unit-test-based RL rewards, achieving +14.2 point OCR improvement over the initial release
(TRivia, 2025) demonstrated self-supervised table recognition that surpasses Gemini 2.5 Pro and GPT-5 without labeled data
(Chain-of-Evidence, 2025) introduced RL-based evidence grounding with bounding box attribution, improving localization IoU by 47.0%

🔀 Reinforcement learning with verifiable rewards (RLVR) replaces supervised fine-tuning as the dominant training paradigm for document tasks, enabling optimization for functional correctness without expensive labels.

2026-01 to 2026-03 Sub-1B compact parsers achieve SOTA; comprehensive benchmarks test agentic reasoning and cross-modal multi-hop understanding

(GLM-OCR, 2026) ranked first on OmniDocBench v1.5 with a 0.9B model using multi-token prediction for 50% throughput gain
PaddleOCR-VL-1.5 (PaddleOCR-VL-1.5, 2026) achieved 94.5% accuracy with mask-based segmentation for warped documents, outperforming 235B-parameter VLMs
MADQA (Strategic Navigation or Stochastic Search?, 2026) introduced agentic document QA, revealing that humans achieve 50% accuracy on first query while Gemini 3 Pro starts at ~12%
(VisDoT, 2026) formalized graphical perception theory for chart grounding, achieving +33.2% on VisDoTQA and surpassing GPT-4o on ChartQAPro

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Reinforcement Learning for Document Intelligence	Uses verifiable, rule-based rewards (unit tests, numerical accuracy) instead of human labels to train document understanding models via RL.	Improves on supervised fine-tuning (SFT) by +14.2 points on olmOCR-Bench (olmOCR 2) and +16.7% relative on MultiChartQA (Chart-RL); TRivia surpasses Gemini 2.5 Pro on CC-OCR benchmark, achieving 84.15 vs 79.46 TEDS.	olmOCR 2: Unit Test Rewards... (2025), Chart-RL (2026), TRivia (2025), LightOnOCR (2026)
Visual Document Retrieval-Augmented Generation	Encodes document pages as visual embeddings using models like ColPali, enabling retrieval that preserves charts, tables, and layout information lost by OCR.	Improves on text-based RAG baselines by +22.5% Recall@1 on page retrieval (MMDocIR); SimpleDoc achieves 60.58% on MMLongBench, outperforming M3DocRAG (41.8%) and MDocAgent (55.3%) with +10.4% on LongDocURL.	M3DocRAG (2024), Chain-of-Evidence (2025), MURE (2026), SimpleDoc (2025)
Compact End-to-End Document Parsers	Separates layout analysis (detection) from content recognition (VLM decoding) in sub-1B models with multi-token prediction for speed.	GLM-OCR achieves 94.6 on OmniDocBench v1.5, ranking first among all models and outperforming GPT-5.2 (87.5) on Nanonets-KIE with 93.7; DocVLM improves DocVQA by +30.6% (56.0% to 86.6%) under a strict 256 visual token limit.	GLM-OCR (2026), PaddleOCR-VL-1.5 (2026), olmOCR: Unlocking Trillions of Tokens... (2025), DocVLM (2024)
Chart Reasoning via Grounding and Transfer	Synthesizes reasoning traces from LLMs and decomposes chart questions into perceptual grounding and logical inference sub-tasks.	Chart-R1 achieves 83.9% on ChartQA, surpassing GPT-4o (80.3%) and Claude-3.5-Sonnet (82.1%); VisDoT improves ChartQA by +11.2% via human-like interpretation grounding and surpasses GPT-4o on ChartQAPro.	Chart-based Reasoning (2024), Chart-R1 (2025), VisDoT (2026), ReFocus (2025)
Scaled Multi-Task VLMs for Documents	Simultaneously scales both vision and language components with multi-stage training recipes including document-specific objectives like text spotting.	PaLI-X achieves 86.0 on VQAv2, surpassing the previous 84.3 SOTA, and 84.5 on TextVQA (+4.6 over prior best 79.9); Idefics3-8B improves DocVQA by +13.7 points over Idefics2-8B using the 240x larger Docmatix dataset.	On Scaling up a Multilingual... (2024), PaliGemma 2 (2024), Building and better understanding vision-language... (2024), SPHINX-X (2024)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
ChartQA	Accuracy (%)	83.9%	Chart-R1 (2025)
OmniDocBench v1.5	Overall Score	94.6	GLM-OCR (2026)
DocVQA	Accuracy (%)	86.6%	DocVLM (2024)
CC-OCR (Table Recognition)	TEDS (Tree Edit Distance Similarity)	84.15 TEDS	TRivia (2025)
olmOCR-Bench	Overall Score (unit test pass rate)	+14.2 points over olmOCR v1	olmOCR 2: Unit Test Rewards... (2025)

⚠️ Known Limitations (4)

Severe multilingual performance degradation: models trained primarily on English data show massive accuracy drops on low-resource languages (e.g., Hindi, Bengali, Odia), limiting global applicability of document understanding systems. (affects: Scaled Multi-Task VLMs for Documents, Chart Reasoning via Grounding and Transfer)
Potential fix: Scalable multilingual data generation pipelines (like PolyChartQA's code decoupling approach) and localized curriculum learning (like VARCO-VISION's bilingual training) show promise for reducing language gaps.
Physical distortion fragility: most document parsers are optimized for clean, digital-born documents and fail significantly on scanned, warped, skewed, or poorly lit real-world documents encountered in production settings. (affects: Compact End-to-End Document Parsers, Visual Document Retrieval-Augmented Generation)
Potential fix: PaddleOCR-VL-1.5's mask-based instance segmentation and SAVIOR's targeted fine-tuning on failure-inducing patterns address specific distortion types, but a general-purpose solution remains elusive.
Benchmark-reality gap: current benchmarks use multiple-choice formats and synthetic data that fail to capture the complexity of enterprise deployment, where free-form generation, noisy inputs, and domain-specific schemas are the norm. (affects: Reinforcement Learning for Document Intelligence, Chart Reasoning via Grounding and Transfer)
Potential fix: Frameworks like ViLD (enterprise-focused evaluation) and MADQA (agentic document QA with accuracy-effort trade-off metrics) are beginning to bridge this gap by evaluating operational capabilities.
Cross-modal reasoning weakness: models show strong text modality bias and struggle with questions requiring integration of evidence across text, tables, and charts within the same document, especially for comparative and tabular reasoning. (affects: Visual Document Retrieval-Augmented Generation, Scaled Multi-Task VLMs for Documents)
Potential fix: Chain-of-Evidence's RL-based stepwise attribution and VisDoT's decomposition-of-thought approach show that explicitly grounding reasoning steps in visual regions can mitigate cross-modal failures.

📚 View major papers in this topic (10)

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models (2025-02) 9
Logics-Parsing-Omni Technical Report (2026-03) 9
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections (2026-03) 9
MME-RealWorld: A Benchmark for MLLM in the Real World (2025-02) 9
On Scaling up a Multilingual Vision and Language Model (2024-07) 9
TRivia: Train Your Own Proprietary Model with Unlabeled Data (2025-12) 9
olmOCR 2: Unit Test Rewards for Document OCR (2025-10) 8
Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner (2025-07) 8
GLM-OCR Technical Report (2026-03) 8
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding (2024-11) 8

💡 Moving to the next paradigm, we turn to Multimodal Reasoning.

🕸️

Multimodal Reasoning

What: Research on enabling models to perform complex multi-step reasoning across visual, auditory, and textual modalities, integrating perception with logical inference.

Why: Real-world tasks—from solving math problems with diagrams to navigating GUIs—require jointly understanding multiple modalities and reasoning over them coherently.

Baseline: Standard multimodal models encode image and text inputs through separate encoders, then generate answers in a single forward pass without iterative verification.

Visual perception errors propagate through reasoning chains, causing cascading failures in downstream steps
Verbose Chain-of-Thought reasoning increases latency and compute cost without proportional accuracy gains
Binary reward signals in reinforcement learning provide no gradient for near-correct predictions

🧪 Running Example

❓ Given a photo of a geometry diagram showing triangle ABC with angle A = 40° and a bisector from B to side AC, find angle x at the intersection.

Baseline: A standard MLLM reads the image in one pass, misidentifies which angle is 40° or overlooks the bisector line, then chains incorrect geometric relationships to produce a wrong answer with no opportunity for self-correction.

Challenge: This example illustrates three key challenges: (1) accurate visual perception to extract angle labels and line segments from the diagram, (2) multi-step geometric reasoning requiring correct prerequisite knowledge, and (3) the need for step-level verification to catch perception errors before they corrupt the entire reasoning chain.

✅ Process Reward Supervision: Evaluates each reasoning step (e.g., 'angle A = 40°' → 'bisector creates two equal angles') and flags the first incorrect step, preventing error propagation through the deduction chain.

✅ Agentic Tool-Augmented Reasoning: Uses visual tools to zoom into the diagram and extract precise angle labels, then invokes a code interpreter to verify geometric calculations before committing to an answer.

✅ Efficient Chain-of-Thought Compression: Encodes intermediate geometric reasoning steps into compact latent tokens, reducing verbose output by ~94% while preserving the essential angle-relationship deduction chain.

📈 Overall Progress

Multimodal reasoning has evolved from evaluating basic visual understanding to actively orchestrating multi-step reasoning with tool use and self-verification. Key paradigm shifts include the transition from binary to process-level rewards, the compression of verbose reasoning into latent representations, and the emergence of agentic frameworks that decouple perception from reasoning. The field now approaches problems where 8B-parameter models with structured training can match or exceed models 10x their size.

📂 Sub-topics

Multimodal Mathematical Reasoning

7 papers

Methods and benchmarks for solving mathematical problems that require understanding visual diagrams, charts, or figures alongside textual problem statements, often involving multi-step logical deduction with process-level supervision.

Process Reward Supervision Mixed Preference Optimization Multi-Agentic Context Engineering

Efficient and Structured Chain-of-Thought

5 papers

Techniques for compressing, pruning, or restructuring Chain-of-Thought reasoning to reduce computational overhead while preserving or improving accuracy, including latent-space reasoning and iterative test-time scaling.

Efficient Chain-of-Thought Compression Unified Multimodal Test-Time Scaling

GUI and Agentic Multimodal Reasoning

4 papers

Research on autonomous agents that interact with graphical interfaces or actively invoke external tools—code execution, web search, visual manipulation—during multimodal reasoning loops.

Agentic Tool-Augmented Reasoning Shaped Reward Reinforcement Learning

Domain-Specific Multimodal Reasoning

7 papers

Application of multimodal reasoning to specialized domains including image quality assessment, audio understanding, time series forecasting, architectural design, knowledge graphs, and video comprehension.

Gaussian Modality Noise Masking Irregularity-Aware Multimodal Fusion Room-Instance Tokenization

Multimodal Evaluation and Safety

4 papers

Benchmarks for assessing multimodal reasoning capabilities across diverse dimensions and research on adversarial robustness, including knowledge poisoning attacks and misinformation detection in multimodal settings.

Process Evaluation via LMM-as-a-Judge Multimodal Poisoning Attacks

💡 Key Insights

💡 Visual perception errors, not reasoning failures, cause most multimodal math mistakes

💡 Process reward models catch flawed reasoning even when final answers appear correct

💡 Latent reasoning tokens compress Chain-of-Thought to 6% without accuracy loss

💡 Agentic tool use enables small models to outperform models ten times their size

💡 Test-time compute scaling transfers effectively from text-only to multimodal domains

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from static benchmark evaluation (2023–2024) through RL-based reward shaping and process supervision (2024–2025) to agentic, tool-augmented reasoning systems with test-time scaling (2025–2026), with increasing emphasis on compute-efficient inference and multi-agent collaboration.

2023-10 to 2024-07 Foundation benchmarks and early multimodal reasoning evaluation

(MM-BigBench, 2023) established the first benchmark for multimodal content comprehension where text and image carry equal semantic weight, going beyond visual-only tasks
(MM-MATH, 2024) introduced process evaluation via LMM-as-a-Judge, revealing that diagram misinterpretation causes over 50% of errors in leading models like GPT-4o
(GAMA, 2024) extended multimodal reasoning to audio by integrating multi-layer feature aggregation with soft semantic prompting, outperforming baselines by 1–84%
(We-Math, 2024) pioneered knowledge-based hierarchical decomposition, exposing that many LMMs exhibit high rote memorization rates while failing prerequisite sub-problems

2024-11 to 2025-03 Reinforcement learning optimization and process supervision for multimodal reasoning

MPO (Enhancing the Reasoning Ability of..., 2024) combined DPO, BCO, and SFT into a unified preference optimization framework, achieving 8B-model performance comparable to 76B models on MathVista
URSA (Unlocking Multimodal Mathematical Reasoning via..., 2025) introduced PS-GRPO with process reward drop-moments and constructed two large-scale datasets (MMathCoT-1M, DualMath-1.1M), outperforming GPT-4o across 6 benchmarks
Heima (Efficient Reasoning with Hidden Thinking, 2025) demonstrated that reasoning chains can be compressed to 6% of their original length by encoding steps into latent 'thinking tokens' with progressive training
(Q-Insight, 2025) adapted GRPO to visual quality tasks, jointly optimizing score regression and degradation perception to achieve 92.77% classification accuracy

🔀 Transition from binary outcome rewards to structured process-level supervision, enabling models to learn from intermediate reasoning quality rather than just final answer correctness.

2025-05 to 2025-11 Efficient reasoning, agentic architectures, and tool-augmented multimodal models

(MM-PRM, 2025) scaled process reward models via MCTS-based automated labeling over 700K step annotations, achieving +10.10% on out-of-distribution OlympiadBench
GUI-Critic-R1 (Look Before You Leap, 2025) introduced pre-operative action critique with S-GRPO, preventing dangerous GUI automation errors before execution with 91.0 Exact Match score
GUI-G2 (GUI-G2, 2025) replaced binary grounding rewards with Gaussian spatial distributions, enabling a 7B model to surpass UI-TARS-72B by 24.7 points
Simple o3 (Simple o3, 2025) reproduced the 'thinking with images' paradigm with observe-reason-act loops integrating dynamic visual tools, surpassing GPT-4o by 27 points on MME reasoning
DeepEyesV2 (DeepEyesV2, 2025) unified code execution and web search in a single agentic reasoning loop via cold-start SFT followed by outcome-driven RL

🔀 Shift from passive single-pass inference to active agentic reasoning where models invoke tools, critique their own actions, and manipulate visual inputs iteratively.

2026-02 to 2026-03 Unified frameworks, multi-agentic systems, and domain-specific extensions

(UniT, 2026) demonstrated that test-time compute scaling transfers to multimodal generation, achieving +225% improvement on multi-turn editing at 2.5x lower cost than parallel sampling
M3-(M3-ACE, 2026) decoupled perception from reasoning using multiple heterogeneous agents, establishing 89.1% SOTA on MathVision competition-level problems
HouseMind (Tokenization Allows MLLMs to Understand,..., 2026) unified spatial understanding and generation through room-instance tokenization, reducing FID from 11.3 to 1.9 on layout generation

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Process Reward Supervision	Monte Carlo Tree Search (MCTS) generates step-level correctness labels to train Process Reward Models (PRMs) that detect reasoning errors at each intermediate step.	Improves on GPT-4o by +2.7% average across 6 multimodal math benchmarks, with URSA-8B achieving +20.6% absolute gain on MathVista-GPS; MM-PRM adds +10.10% accuracy on out-of-distribution OlympiadBench	Unlocking Multimodal Mathematical Reasoning via... (2025), MM-PRM (2025)
Shaped Reward Reinforcement Learning	Continuous reward functions (Gaussian spatial distributions, soft multi-choice rewards, mixed preference objectives) replace sparse binary signals in multimodal reinforcement learning.	Improves on UI-TARS-72B by +24.7 percentage points on ScreenSpot-Pro, with GUI-G2 achieving 47.5% accuracy using a 7B-parameter model against a 72B baseline	Enhancing the Reasoning Ability of... (2024), GUI-G2 (2025), Reinforcing Video Reasoning with Focused... (2025), Q-Insight (2025)
Efficient Chain-of-Thought Compression	Reasoning steps are compressed into hidden 'thinking token' representations or pruned by suppressing reflection keywords, drastically reducing generation without sacrificing accuracy.	Heima reduces generated tokens to 6% of standard CoT volume while maintaining comparable zero-shot accuracy; NoWait reduces trajectory length by 27–51% with +4.25% accuracy on AMC 2023	Efficient Reasoning with Hidden Thinking (2025), Wait, We Don't Need to... (2025)
Agentic Tool-Augmented Reasoning	A two-stage training pipeline (cold-start supervised fine-tuning followed by outcome-driven RL) teaches models when and how to invoke tools and coordinate with other agents during reasoning.	Improves on Qwen3.5 by +10.2 percentage points on MathVision, with M3-ACE achieving 89.1% state-of-the-art accuracy via multi-agentic perception correction	Look Before You Leap: A... (2025), Simple o3 (2025), DeepEyesV2 (2025), M3-ACE (2026)
Unified Multimodal Test-Time Scaling	Budget forcing at inference compels the model to continue iterative verify-refine loops, generalizing from short training chains to arbitrarily longer inference chains.	Improves over single-pass baselines by +53.33% on MIRA out-of-distribution visual reasoning and +225.19% on ImgEdit multi-turn editing, matching best-of-N sampling at 2.5x lower cost	UniT (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MathVista	Accuracy (%)	67.0%	Enhancing the Reasoning Ability of... (2024)
ScreenSpot-Pro	Accuracy (%)	47.5%	GUI-G2 (2025)
MathVision	Accuracy (%)	89.1%	M3-ACE (2026)
MIRA	Accuracy (% relative improvement)	+53.33% improvement over single-pass baseline	UniT (2026)
CLEVRER	Accuracy (%)	50.4%	Reinforcing Video Reasoning with Focused... (2025)

⚠️ Known Limitations (4)

Visual perception remains the primary bottleneck—models misinterpret diagrams in over 50% of error cases, yet most methods assume reasonably accurate initial perception (affects: Process Reward Supervision, Shaped Reward Reinforcement Learning, Unified Multimodal Test-Time Scaling)
Potential fix: Multi-agentic perception correction (M3-ACE) and iterative visual tool use (Simple o3) decouple perception from reasoning to mitigate this bottleneck
Computational overhead from multi-step reasoning, tool invocation, and iterative refinement significantly increases inference latency, making real-time applications challenging (affects: Agentic Tool-Augmented Reasoning, Unified Multimodal Test-Time Scaling, Process Reward Supervision)
Potential fix: Latent-space reasoning (Heima) and keyword suppression (NoWait) reduce token generation by 50–94%, partially offsetting the overhead of complex reasoning pipelines
Reward hacking and length bias in RL-trained models can produce degenerate reasoning patterns that game reward signals without improving genuine understanding (affects: Shaped Reward Reinforcement Learning, Process Reward Supervision)
Potential fix: PS-GRPO uses 'drop-moment' detection to penalize correct outcomes achieved through flawed reasoning; TW-GRPO applies entropy-based token weighting to focus learning on informative tokens
Multimodal RAG systems are vulnerable to knowledge poisoning attacks—a single adversarial image can reduce accuracy to 0% across all queries via globalized poisoning (affects: Agentic Tool-Augmented Reasoning)
Potential fix: Robust retrieval mechanisms and adversarial filtering of knowledge base entries are needed, though comprehensive defenses against multimodal poisoning remain an open research problem

📚 View major papers in this topic (10)

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling (2026-02) 9
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents (2025-08) 9
M3-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering (2026-03) 8
GUI-G2: Gaussian Reward Modeling for GUI Grounding (2025-07) 8
Unlocking Multimodal Mathematical Reasoning via Process Reward Model (2025-01) 8
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision (2025-05) 8
Simple o3: Towards Interleaved Vision-Language Reasoning (2025-08) 8
Efficient Reasoning with Hidden Thinking (2025-01) 8
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization (2024-11) 8
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification (2024-04) 8

💡 Diving deeper into Multimodal Reasoning, let's examine specific research threads that define this area.

✍️

Visual and Spatial Reasoning

What: Research on enabling multimodal models to perform multi-step logical inference over visual inputs, integrating perception with reasoning for complex visual problem-solving.

Why: Bridging the gap between human-like visual cognition and current models is essential for trustworthy AI in real-world physical and scientific domains.

Baseline: Standard multimodal LLMs encode images once as static features and generate text answers via pattern matching without intermediate reasoning steps.

Models rely on language shortcuts rather than genuine visual understanding, producing correct answers from hallucinated reasoning
Reinforcement learning for multimodal reasoning suffers from reward sparsity, entropy collapse, and gradient vanishing on hard problems
Spatial and 3D reasoning remains fundamentally weak, with top models barely exceeding random chance on abstract visual logic tasks

🧪 Running Example

❓ Given a photograph of a cluttered kitchen, determine which object is closest to the camera and estimate the distance to the refrigerator in meters.

Baseline: A standard MLLM would encode the image once and attempt to answer directly in text, likely hallucinating distances or confusing depth relationships because it lacks spatial grounding and intermediate reasoning steps.

Challenge: This example requires (1) genuine visual perception of depth and spatial layout, not just object recognition, (2) multi-step reasoning combining relative positions, scale estimation, and 3D understanding, and (3) grounding the answer in specific image regions rather than guessing from language priors.

✅ RLVR-Enhanced Multimodal Reasoning: Trains the model via reinforcement learning with verifiable spatial rewards (e.g., distance accuracy), incentivizing genuine visual reasoning over shortcut answers.

✅ Visual Chain-of-Thought Reasoning: The model generates intermediate visual tokens or auxiliary images (e.g., a depth map or annotated layout) as reasoning steps, enabling iterative spatial analysis before answering.

✅ Tool-Augmented Visual Reasoning: The model invokes depth estimation and object detection tools, overlays results on the image, and reasons over the tool outputs to compute distances rather than guessing.

✅ Allocentric Spatial Reasoning: World2Mind constructs a top-down allocentric map from the egocentric view, enabling the model to reason about absolute distances and spatial relationships from a global perspective.

📈 Overall Progress

The field evolved from modular pipelines composing LLMs with vision experts (2023) through a massive RL revolution driven by GRPO and its variants (2025), to sophisticated methods addressing spatial intelligence and agentic tool use (2026). A key paradigm shift was recognizing that supervised fine-tuning primarily teaches format while RL teaches transferable reasoning. Process reward models and visual chain-of-thought have emerged as complementary advances, enabling both training-time and test-time scaling for multimodal reasoning.

📂 Sub-topics

RL-Based Multimodal Reasoning Optimization

38 papers

Applying reinforcement learning—primarily GRPO and its variants—to enhance multimodal models' reasoning capabilities through verifiable rewards, addressing gradient vanishing, reward sparsity, and training instability.

GRPO variants Variance-Aware Sampling Cold Start + RL Dense reward shaping

Visual Chain-of-Thought Methods

14 papers

Extending chain-of-thought reasoning beyond text into the visual domain by interleaving generated images, latent visual tokens, or auxiliary diagrams as intermediate reasoning steps.

Multimodal Visualization-of-Thought MINT token selection Latent visual reasoning Visual sketchpad

Spatial and 3D Visual Reasoning

12 papers

Enabling models to understand and reason about 3D spatial relationships, object orientations, distances, and dynamic spatial interactions from visual inputs including images and videos.

Spatial GRPO Allocentric spatial trees Semantic orientation Map imagination

Tool-Augmented Visual Reasoning

10 papers

Enhancing multimodal models with external vision tools (detectors, depth estimators, code interpreters) and training them via RL to adaptively select and compose tools for complex visual tasks.

ReVPT AdaReasoner VisualSketchpad Dynamic API synthesis

Benchmarks and Evaluation Frameworks

11 papers

Datasets and evaluation protocols that measure genuine visual reasoning capabilities, exposing gaps between model performance and human cognition across abstract reasoning, spatial understanding, and multimodal integration.

Abstract visual IQ tests Vision-text consistency metrics Situated reasoning evaluation

💡 Key Insights

💡 RL generalizes while SFT memorizes—reinforcement learning teaches transferable visual reasoning principles

💡 Text-only cold start surprisingly outperforms multimodal data for initializing visual reasoning capabilities

💡 Best models achieve near-random accuracy on abstract visual logic, revealing a massive human-AI gap

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research rapidly converged on RLVR as the dominant paradigm after DeepSeek-R1, with 2025 seeing an explosion of GRPO variants addressing multimodal-specific challenges (reward sparsity, text bias, entropy collapse), while 2026 shifts toward spatial intelligence, embodied reasoning, and adaptive tool orchestration.

2023-03 to 2024-06 Foundation models, pioneering multimodal reasoning architectures, and early visual CoT

(MM-REACT, 2023) pioneered composing ChatGPT with vision experts via textual prompts
(Visual Instruction Tuning, 2023) established the visual instruction tuning paradigm, connecting CLIP to Vicuna via simple linear projection
(T-SciQ, 2023) achieved 96.18% on ScienceQA using mixed LLM-generated CoT signals, surpassing human performance
(GoT, 2023) modeled non-linear reasoning as graph structures rather than sequential chains
GPT-4V exploration (The Dawn of LMMs, 2023) systematically documented LMM capabilities including visual referring prompting
(VisualSketchpad, 2024) enabled models to draw on images as visual reasoning steps, setting SOTA on V*Bench

🔀 Transition from task-specific vision models to general-purpose multimodal LLMs that combine visual perception with language reasoning via instruction tuning.

2024-07 to 2025-06 RL revolution for multimodal reasoning and challenging benchmark creation

Kimi k1.5 (Kimi k1.5, 2025) matched OpenAI o1 using long-context RL with partial rollouts, without Monte Carlo Tree Search
Vision-R1 (Vision-R1, 2025) introduced modality bridging and progressive thinking suppression for stable multimodal RL training
(VisualPRM, 2025) built the first large-scale multimodal process reward model enabling fine-grained step-level supervision
Cold Start study (Advancing Multimodal Reasoning via RL..., 2025) demonstrated that SFT initialization is critical, achieving 73.4% MathVista surpassing GPT-4o
MVoT (Imagine while Reasoning in Space, 2025) introduced visual token generation during reasoning, outperforming text CoT by +20% on spatial tasks
EMMA benchmark (Can MLLMs Reason in Multimodality?, 2025) exposed that most 'multimodal' questions can be solved without images, filtering to truly visual tasks

🔀 Shift from supervised fine-tuning to reinforcement learning with verifiable rewards (RLVR) as the dominant training paradigm for multimodal reasoning, catalyzed by DeepSeek-R1's success.

2025-07 to 2026-03 Advanced RL optimization, spatial intelligence, and agentic tool use

MMR1 (MMR1, 2025) solved GRPO gradient vanishing via Variance-Aware Sampling, achieving SOTA 58.4 across multimodal reasoning benchmarks
World2(World2Mind, 2026) introduced training-free allocentric spatial reasoning with +17.6% on VSI-Bench
(AdaReasoner, 2026) achieved 97.6% on spatial planning via RL-trained adaptive tool orchestration
(Anchor-Token, 2026) identified that only ~15% of tokens are visually grounded perceptual anchors, enabling targeted reward allocation
Compositional Visual Reasoning Survey (Explain Before You Answer, 2025) synthesized 260+ papers into a five-stage evolutionary roadmap for visual reasoning

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
RLVR-Enhanced Multimodal Reasoning	Group Relative Policy Optimization (GRPO) uses group-based reward normalization to optimize reasoning without critic models, extended with multimodal-specific stabilization techniques.	MMR1-7B achieves 58.4 avg across 5 benchmarks, surpassing R1-VL-7B (47.7) by +10.7 points; Vision-R1-7B reaches 73.5% on MathVista, near OpenAI o1's 73.9%	Kimi k1.5 (2025), MMR1 (2025), Advancing Multimodal Reasoning via Reinforcement... (2025), GPG (2025), Stable and Efficient Single-Rollout RL... (2025)
Visual Chain-of-Thought Reasoning	Models generate interleaved visual artifacts (images, crops, latent embeddings) as intermediate reasoning steps, bridging the semantic gap between perception and language.	MVoT outperforms text CoT by +20% on complex spatial tasks (FrozenLake 85.6% vs CoT 39.1%); MINT-CoT-7B improves +34.08% on MathVista over baseline	Imagine while Reasoning in Space:... (2025), MINT-CoT (2025), Monet (2025), MathCanvas (2025)
Multimodal Process Reward Models	Process Reward Models (PRMs) score individual reasoning steps via Monte Carlo estimation or consistency filtering, providing dense supervision beyond binary final-answer rewards.	VisualPRM-8B improves InternVL2.5-78B by +5.9 points across 7 benchmarks; Athena-PRM achieves 83.1 F1 on VisualProcessBench, outperforming prior best by +3.9; DreamPRM reaches 85.2% on MathVista leaderboard	VisualPRM (2025), Athena (2025), DreamPRM (2025), AutoRubric-R1V (2025)
Tool-Augmented Visual Reasoning	Models learn when and how to invoke external vision tools through reinforcement learning, treating tool selection as a trainable reasoning skill rather than static supervised behavior.	ReVPT-7B improves +9.82% on CV-Bench over Qwen2.5-VL-7B; AdaReasoner-7B surpasses GPT-5 on spatial planning (96.6% vs 80.1%); VisualSketchpad boosts GPT-4o by +12.7% on math tasks	MM-REACT (2023), VisualSketchpad (2024), Reinforced Visual Perception with Tools (2025), AdaReasoner (2026)
Allocentric Spatial and Embodied Reasoning	Converting egocentric visual observations into global allocentric representations (spatial trees, semantic orientations, grid maps) enables reasoning about absolute positions and 3D relationships.	World2Mind improves Claude-4.6-Opus by +17.6% on VSI-Bench (38.4%→56.0%); SpaceR achieves 45.6% on VSI-Bench, surpassing GPT-4o by +11.6%; vsGRPO-2B outperforms GPT-4o on visual-spatial tasks	World2Mind (2026), M2-Reasoning (2025), SoFar (2025), Embodied-Reasoner (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MathVista	Accuracy (%)	85.2%	DreamPRM (2025)
VSI-Bench	Average Accuracy (%)	56.0%	World2Mind (2026)
ScienceQA	Accuracy (%)	96.18%	T-SciQ (2023)
CV-Bench (Spatial Reasoning)	Average Accuracy (%)	82.3%	M2-Reasoning (2025)
VisuLogic	Accuracy (%)	31.1%	VisuLogic (2025)

⚠️ Known Limitations (4)

Text-bias and language shortcuts: models frequently arrive at correct answers by exploiting textual patterns rather than processing visual information, producing 'right answers for wrong reasons' (affects: RLVR-Enhanced Multimodal Reasoning, Visual Chain-of-Thought Reasoning)
Potential fix: Text-bias calibration by subtracting text-only predictions from multimodal predictions; visual perception rewards that verify grounding; answer-grounding consistency metrics
Entropy collapse and reward sparsity: GRPO-based training frequently leads to premature convergence where models stop exploring, especially on hard problems where all sampled responses fail (affects: RLVR-Enhanced Multimodal Reasoning)
Potential fix: Variance-aware sampling to select prompts with mixed outcomes; latent spectral dispersion regularization; hint-guided training that provides partial solutions to unlock gradient signals on hard problems
Fundamental visual perception failures: 72-78% of reasoning errors stem from incorrect visual perception rather than flawed logic, and models perform worse on images than equivalent text descriptions (affects: RLVR-Enhanced Multimodal Reasoning, Visual Chain-of-Thought Reasoning, Allocentric Spatial and Embodied Reasoning)
Potential fix: Visual-text self-distillation to close the modality gap; dedicated visual perception reward signals during RL training; tool augmentation to offload fine-grained perception to specialist models
Scalability to small models: most advances target 7B+ parameter models, while compact models (<4B) struggle with complex multimodal reasoning and are under-explored (affects: RLVR-Enhanced Multimodal Reasoning, Visual Chain-of-Thought Reasoning)
Potential fix: Two-stage text-first RL bootstrapping before multimodal transfer; relaxed on-policy distillation from larger teachers; no-thinking and adaptive-thinking strategies that reduce computational overhead for simple tasks

📚 View major papers in this topic (10)

Visual Instruction Tuning (2023-04) 9
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) (2023-09) 9
Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025-01) 9
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering (2023-05) 9
World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models (2026-03) 9
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers (2025-06) 9
Explain Before You Answer: A Survey on Compositional Visual Reasoning (2025-08) 9
QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training (2025-05) 9
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation (2025-05) 9
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought (2025-01) 8

💡 Within the same paradigm, another important research direction focuses on Hallucination Mitigation.

🔗

Hallucination Mitigation

What: Research on detecting, measuring, and reducing hallucinations in Multimodal Large Language Models — outputs that contradict visual evidence or factual knowledge.

Why: Hallucinations undermine MLLM reliability in safety-critical applications like medical imaging, autonomous navigation, and visual assistants.

Baseline: Standard MLLMs generate text auto-regressively from visual features, relying on language priors without explicit grounding or verification mechanisms.

Models over-rely on language priors and statistical co-occurrence rather than grounding responses in actual visual content
Preference optimization often overfits to easy examples while failing on nuanced hallucination cases
No unified benchmark covers all hallucination types across faithfulness, factuality, and reasoning dimensions

🧪 Running Example

❓ Given an image of a park with two brown dogs near a red bench: 'How many dogs are in this image, and what color is the bench?'

Baseline: A standard MLLM might respond 'There are three dogs near a blue bench in a park with a fountain,' hallucinating an extra dog, the wrong color, and a non-existent fountain due to language priors about typical park scenes.

Challenge: This illustrates object hallucination (inventing a third dog), attribute hallucination (wrong bench color), and extrinsic hallucination (fabricating a fountain) — showing how models rely on statistical co-occurrence patterns rather than visual grounding.

✅ Preference-Based Alignment: RLHF-V collects segment-level corrections (changing 'three dogs' to 'two dogs' and 'blue' to 'red'), then uses Dense DPO to penalize the exact hallucinated segments while preserving correct parts.

✅ Training-Free Decoding Intervention: OPERA detects that the model is over-attending to summary tokens (punctuation marks) rather than visual features, penalizes this pattern in beam search, and forces re-examination of the image tokens.

✅ Grounded Chain-of-Thought Reasoning: GCoT requires the model to first locate each dog with bounding boxes [x1,y1,x2,y2] before counting, catching the miscount through explicit spatial grounding of each entity.

✅ Robust Data Curation & Unlearning: LRV-Instruction includes negative examples that ask about non-existent objects (e.g., 'Is there a fountain?'), teaching the model to confidently answer 'No' when visual evidence is absent.

📈 Overall Progress

The field has progressed from surface-level output correction (2023) through self-improvement and data-centric methods (2024) to process-aware evaluation and cross-modal robustness (2025–2026). A major paradigm shift occurred with the realization that correct final answers often mask severe reasoning hallucinations — shifting evaluation focus from outcomes to intermediate thinking. Concurrently, methods evolved from requiring costly human feedback to fully self-supervised approaches using self-generated preference pairs and training-free inference interventions.

📂 Sub-topics

Preference-Based Alignment

7 papers

Methods that use human or automated feedback with preference optimization (RLHF, DPO, GRPO) to align MLLM outputs with visual ground truth and reduce hallucinations through reward shaping.

Dense Direct Preference Optimization Factually Augmented RLHF Difficulty-Aware DPO Modality-Decoupled DPO

Decoding & Inference-Time Mitigation

4 papers

Training-free methods that modify the decoding process or apply inference-time interventions (attention penalties, steering vectors, tool-based verification) to suppress hallucinations without retraining.

Over-trust Penalty Decoding Flexible Association Control Program-of-Thought Claim Verification

Grounded Reasoning & Chain-of-Thought

5 papers

Approaches that incorporate explicit visual grounding (bounding boxes, spatial coordinates) into chain-of-thought reasoning to ensure models justify answers with verifiable visual evidence.

Self-Feedback Guided Revision Grounded Chain-of-Thought Planning-Action-Summarization CoT Visual-Spatial Reasoning

Benchmarks & Evaluation Frameworks

7 papers

Diagnostic benchmarks and evaluation methodologies that systematically measure different hallucination types (existence, attribute, relation, faithfulness, factuality) across diverse tasks and contexts.

Adversarial VH Generation LLM-free Multi-dimensional Evaluation Long-Context MCQ Evaluation Process-Aware Thinking Evaluation

Data Curation & Instruction Tuning

5 papers

Methods that improve hallucination robustness through better training data design — including negative instruction examples, diverse high-quality datasets, targeted unlearning, and parameter-efficient strategies.

Negative Instruction Tuning Semi-Automatic Instruction Generation Sharpness-Aware Robust Erasure PEFT Benchmarking

Domain-Specific Applications

4 papers

Hallucination mitigation techniques tailored to specific domains such as medical report generation, agriculture, and energy forecasting, where factual accuracy is critical.

Adaptive Fusion for Radiology Visual Fact-Guided Generation Vision-RAG Pipeline

Surveys & Theoretical Frameworks

4 papers

Comprehensive surveys of AGI hallucination across modalities and theoretical frameworks that formalize hallucination measurement using information geometry and cognitive science.

AGI Hallucination Taxonomy Spectral Information-Geometric Framework Cognitive Friction Metric

💡 Key Insights

💡 Segment-level human corrections reduce hallucinations 7× more efficiently than whole-response ranking

💡 Training-free attention penalties and steering vectors eliminate hallucinations without model retraining

💡 Correct final answers frequently mask severe hallucinations in intermediate reasoning steps

💡 Larger models paradoxically ground worse — 72B models show lower consistency than 7B counterparts

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has moved from reactive post-hoc correction toward proactive grounding — embedding spatial verification directly into reasoning chains — while simultaneously expanding evaluation from simple object existence checks to multi-dimensional, process-aware, cross-modal benchmarks.

2023-06 to 2023-12 Foundation — establishing the hallucination problem and first-generation mitigations

LRV-Instruction (Mitigating Hallucination in Large Multi-Modal..., 2023) introduced the first large-scale negative instruction dataset with 400k examples covering 16 tasks
Fact-RLHF (Aligning Large Multimodal Models with..., 2023) pioneered factually augmented reward models with 'cheat sheets' and created MMHal-Bench
(OPERA, 2023) discovered columnar attention patterns as the root cause and introduced training-free decoding intervention
(AMBER, 2023) established LLM-free, reproducible hallucination evaluation across generative and discriminative tasks
(RLHF-V, 2023) demonstrated that segment-level DPO reduces hallucinations by 34.8% with 7× less data than response-level RLHF
(Volcano, 2023) introduced single-model critique-revise-decide loops for self-correction

🔀 Shift from treating hallucination as a secondary failure mode to a primary research target, with dedicated datasets, benchmarks, and alignment methods.

2024-01 to 2024-12 Diversification — expanding evaluation dimensions, self-improvement, and domain applications

VHTest (Visual Hallucinations of Multi-modal Large..., 2024) used CLIP/DINO discrepancy to adversarially generate diverse hallucination instances across 8 modes
(Cantor, 2024) replaced fragmented external tools with MLLM-as-experts via prompted role-playing
SIMA (Enhancing Visual-Language Modality Alignment via Self-Improvement, 2024) demonstrated hallucination reduction without any external models using self-generated preference pairs
(MMInstruct, 2024) built a semi-automatic data engine achieving SOTA on 10 of 12 benchmarks
(Pelican, 2024) introduced computational graph verification reducing hallucinations by 27% over Woodpecker
LLM-RG4 (LLM-RG4, 2024) applied adaptive token fusion and loss weighting to eliminate input-agnostic hallucinations in medical reports

2025-01 to 2026-03 Maturation — deeper grounding, process-aware evaluation, and cross-modal robustness

GCoT (Grounded Chain-of-Thought for MLLMs, 2025) revealed an inverse scaling phenomenon where larger models ground worse, with 72B models showing only 11.1% consistency despite 75.7% accuracy
(Rex-Thinker, 2025) combined structured planning-action-summarization CoT with GRPO reinforcement for 86.8% rejection accuracy
(FlexAC, 2025) discovered middle-layer steering vectors enabling dynamic faithfulness-creativity control at inference time
(MM-THEBench, 2026) exposed that top models achieve 70.6% answer accuracy but only 22.8% thinking correctness
(Modality-Decoupled, 2026) introduced modality-aware invariance and sensitivity regularization achieving +27% on cross-modal hallucination tasks
(Sharpness-Aware, 2026) formulated unlearning as a min-max game, making hallucination erasure robust against fine-tuning perturbations
(INFACT, 2026) revealed that most video-LLMs have near-zero temporal sensitivity, relying on static cues rather than temporal understanding

🔀 Shift from output-level hallucination detection to process-level reasoning evaluation, revealing that correct answers often mask hallucinated intermediate reasoning steps.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Preference-Based Alignment	Align model outputs with visual ground truth via Direct Preference Optimization (DPO) or Reinforcement Learning from Human Feedback (RLHF) on carefully constructed preference pairs.	RLHF-V reduces hallucination rate by 34.8% over the base MLLM using only 1.4k samples, outperforming LLaVA-RLHF which required 10k samples. MoD-DPO achieves +27% accuracy on AVHBench over vanilla DPO.	RLHF-V (2023), Aligning Large Multimodal Models with... (2023), Modality-Decoupled (2026), DA-DPO (2026), Enhancing Visual-Language Modality Alignment in... (2024)
Training-Free Decoding Intervention	Detect and intervene on hallucination-prone attention or activation patterns during decoding, requiring zero additional training or data.	OPERA achieves up to +35.8% improvement on the CHAIR hallucination metric over standard beam search across InstructBLIP, MiniGPT-4, LLaVA, and Shikra. FlexAC reduces CHAIR hallucination by 29% while boosting creativity 5.8× on Creation-MMBench.	OPERA (2023), FlexAC (2025), Pelican (2024)
Grounded Chain-of-Thought Reasoning	Embed explicit spatial grounding (bounding boxes, visual prompts) into chain-of-thought reasoning to ensure each reasoning step is tied to verifiable visual evidence.	GCoT achieves +55.7% improvement in Answer-Grounding Consistency over baseline LLaVA-7B (from ~11% to ~67%). Volcano achieves +24.9% on hallucination benchmarks over prior methods like Woodpecker and LLaVA-RLHF.	Grounded Chain-of-Thought for Multimodal Large... (2025), Volcano (2023), Rex-Thinker (2025), Cantor (2024)
Robust Data Curation & Unlearning	Curate balanced training data with explicit negative examples and diverse instructions, or erase hallucination patterns through robust unlearning that survives fine-tuning perturbations.	LRV-Instruction improves POPE accuracy by +28.2 points (from 56.8 to 85.0) on MiniGPT4. MMInstruct achieves 1626.2 on MME, surpassing LLaVA-1.5 baseline by +94.9 points. SARE reduces Chair_S from 69.6 to 37.3, outperforming EFUF baseline (43.6).	Mitigating Hallucination in Large Multi-Modal... (2023), MMInstruct (2024), Beyond Superficial Unlearning (2026), An Empirical Study on Parameter-Efficient... (2024)
Comprehensive Hallucination Evaluation	Construct diverse, multi-dimensional hallucination benchmarks with fine-grained taxonomies and reproducible automated metrics to replace costly human or GPT-4 evaluation.	VHTest reveals GPT-4V achieves only 38.3% accuracy on adversarially generated hallucination instances. MM-THEBench shows Qwen3-VL-235B has 70.6% answer accuracy but only 22.8% thinking correctness, exposing hidden reasoning hallucinations.	AMBER (2023), Visual Hallucinations of Multi-modal Large... (2024), MM-THEBench (2026), INFACT (2026), LongHalQA (2024)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
POPE (Polling-based Object Probing Evaluation)	Accuracy	85.0% accuracy (Random split)	Mitigating Hallucination in Large Multi-Modal... (2023)
CHAIR (Caption Hallucination Assessment with Image Relevance)	CHAIR_S (sentence-level hallucination rate, lower is better)	37.3 CHAIR_S on mPLUG-Owl (reduced from 69.6 vanilla)	Beyond Superficial Unlearning (2026)
MMHal-Bench	Overall Score	60% relative improvement over baselines	Aligning Large Multimodal Models with... (2023)
Answer-Grounding Consistency (GCoT Metric)	Consistency Rate	+55.7% improvement over baseline LLaVA-7B	Grounded Chain-of-Thought for Multimodal Large... (2025)

⚠️ Known Limitations (4)

Most methods are validated only on 7B-13B parameter models, leaving uncertainty about whether findings scale to frontier-class MLLMs (70B+) or exhibit inverse scaling effects (affects: Preference-Based Alignment, Robust Data Curation & Unlearning, Grounded Chain-of-Thought Reasoning)
Potential fix: GCoT's inverse scaling finding suggests that grounding-aware training specifically needs to be applied at scale; larger models may require proportionally more grounding data or stronger architectural constraints.
Evaluation is fragmented across incompatible benchmarks (POPE, CHAIR, MMHal-Bench, AMBER) with different metrics, making cross-method comparison unreliable and potentially misleading (affects: Comprehensive Hallucination Evaluation, Preference-Based Alignment, Training-Free Decoding Intervention)
Potential fix: Unified multi-dimensional benchmarks like AMBER and MM-THEBench are moving toward standardization, but the community needs consensus on a common evaluation protocol covering faithfulness, factuality, and reasoning dimensions.
Methods that reduce hallucinations often suppress creative and associative reasoning, creating a faithfulness-creativity trade-off that limits MLLM applicability in open-ended tasks (affects: Preference-Based Alignment, Training-Free Decoding Intervention)
Potential fix: FlexAC demonstrates that steering vectors can dynamically adjust the faithfulness-creativity balance at inference time, suggesting adaptive control mechanisms as a promising direction.
Standard DPO-based unlearning achieves only superficial suppression — hallucinations catastrophically resurge after lightweight fine-tuning or parameter perturbation (affects: Preference-Based Alignment, Robust Data Curation & Unlearning)
Potential fix: SARE's min-max formulation with sharpness-aware optimization demonstrates that flattening the loss landscape around unlearned states can make erasure robust to fine-tuning perturbations.

📚 View major papers in this topic (10)

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback (2023-12) 8
Aligning Large Multimodal Models with Factually Augmented RLHF (2023-09) 8
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation (2023-11) 8
Visual Hallucinations of Multi-modal Large Language Models (2024-02) 8
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2024-07) 8
Modality-Decoupled Direct Preference Optimization (2026-03) 8
Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs (2026-01) 8
FlexAC: Flexible Association Control for Multimodal Large Language Models (2025-10) 8
Grounded Chain-of-Thought for Multimodal Large Language Models (2025-03) 7
MM-THEBench: Do Reasoning MLLMs Think Reasonably? (2026-01) 8

💡 Within the same paradigm, another important research direction focuses on Multimodal RLHF and Preference Alignment.

⚙️

Multimodal RLHF and Preference Alignment

What: Research on aligning multimodal large language models with human preferences through reward modeling, RLHF, and direct preference optimization across vision-language tasks.

Why: Multimodal models frequently hallucinate, ignore visual details, or produce outputs misaligned with human expectations, undermining trustworthiness and usability.

Baseline: Standard supervised fine-tuning on curated vision-language datasets with single-step, scalar reward signals from human annotations.

Step-level supervision for multimodal reasoning is expensive and difficult to automate reliably
Reward models struggle to generalize across diverse modalities and complex reasoning traces
Cross-modal preference alignment must jointly optimize over visual, textual, and contextual dimensions

🧪 Running Example

❓ Given a photo of a cluttered kitchen, generate concise alt-text that accurately describes the salient objects and their spatial relationships for a screen reader.

Baseline: A standard SFT-trained model generates a verbose, generic caption like 'A kitchen with many items on the counter' — missing specific objects, spatial details, and failing to adapt to surrounding webpage context.

Challenge: This example illustrates all key challenges: (1) the model's reasoning steps (identifying objects, spatial layout) lack intermediate supervision; (2) a single reward score cannot distinguish hallucinated objects from missing details; (3) the alt-text must align across the image content, the generated text, and the surrounding page context simultaneously.

✅ Stepwise Visual Program Reward Modeling: SVIP generates executable code to verify each reasoning step (object detection, spatial checks), providing fine-grained, multi-dimensional reward signals that catch intermediate errors before the final caption.

✅ Cross-Modal Direct Preference Optimization: MCM-DPO constructs preference pairs across image, text, and context dimensions, teaching the model to produce concise alt-text aligned with all three modalities simultaneously.

✅ Scalable Multimodal Reward Models: Skywork-VL Reward provides generalizable reward signals across diverse tasks, enabling the model to improve both caption quality and contextual relevance through preference-based training.

📈 Overall Progress

The field has progressed from foundational visual-to-text alignment methods (2021) to unified any-modality frameworks (2023) to sophisticated multimodal preference optimization (2024–2025). A major paradigm shift occurred with the move from single-step, scalar rewards to step-level, multi-dimensional reward signals that verify intermediate multimodal reasoning. The latest work increasingly focuses on reward model robustness, few-shot adaptability, and cross-modal preference dimensions.

📂 Sub-topics

Multimodal Projection and Representation Alignment

3 papers

Methods for projecting diverse modality signals (image, audio, video) into a shared language model embedding space using lightweight adapters, enabling frozen LLMs to process multimodal inputs without modifying their weights.

Visual Prefix Tuning for Frozen LLMs Unified Any-to-Any Multimodal Alignment

Multimodal Reward Modeling

3 papers

Building reward models that provide fine-grained, multi-dimensional reward signals for multimodal reasoning, including step-level supervision via executable visual programs and few-shot activation steering.

Stepwise Visual Program Reward Modeling Scalable Multimodal Reward Models

Multimodal Preference Optimization

2 papers

Adapting preference optimization methods such as DPO and RLHF to multimodal settings, incorporating cross-modal preference pairs and negative supervision for vision-language alignment.

Cross-Modal Direct Preference Optimization

💡 Key Insights

💡 Step-level multi-dimensional rewards outperform coarse single-score supervision for multimodal reasoning

💡 Negative supervision from rejected responses captures the core value of multimodal RLHF

💡 Frozen LLM weights preserve factual knowledge better than fine-tuned alternatives in multimodal settings

💡 Few-shot activation steering resists reward hacking more effectively than prompting-based approaches

💡 Cross-modal preference pairs across visual, textual, and contextual dimensions improve alignment robustness

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from simply projecting visual features into LLM space toward building comprehensive reward and preference systems that capture fine-grained, multi-dimensional alignment across diverse modalities and reasoning steps.

2021-12 to 2021-12 Pioneering visual alignment to frozen language models

Frozen (Multimodal Few-Shot Learning with Frozen..., 2021) introduced visual prefix tuning that treats images as continuous words, achieving multimodal few-shot learning with a completely frozen LLM

🔀 First demonstration that visual inputs can be projected into a frozen LLM's embedding space as continuous tokens, enabling multimodal reasoning without modifying the language model.

2023-09 to 2023-09 Scaling to any-to-any multimodal alignment across five or more modalities

(NExT-GPT, 2023) connected frozen encoders and diffusion decoders to an LLM core for end-to-end any-to-any multimodal generation
(Any-Modality, 2023) demonstrated scalable 70B-parameter multimodal alignment using quantized pre-training across five modalities

🔀 Extension from image-only alignment to unified frameworks supporting five-plus modalities (image, audio, video, motion) and bidirectional generation.

2024-11 to 2025-10 Multimodal reward modeling and preference optimization with fine-grained supervision

nSFT (Continual SFT Matches Multimodal RLHF..., 2024) revealed that negative supervision is the core value of multimodal RLHF, proposing a simpler SFT-based alternative
SVIP (Benchmarking Multimodal CoT Reward Model..., 2025) automated step-level reward annotation using executable visual programs with three-dimensional quality labels
(Skywork-VL, 2025) achieved state-of-the-art multimodal reward modeling through dual-source data curation and two-stage training
Activation Reward (Activation Reward Models for Few-Shot..., 2025) introduced few-shot activation steering that surpasses GPT-4o on reward hacking resistance
(MCM-DPO, 2025) extended DPO with seven-dimensional cross-modal preference pairs for alt-text generation

🔀 Shift from coarse final-answer rewards to fine-grained, step-level, multi-dimensional reward signals and cross-modal preference optimization.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Visual Prefix Tuning for Frozen LLMs	Treat images as continuous 'visual words' aligned to a frozen LLM's embedding space, preserving its reasoning and few-shot abilities.	Improves over fine-tuning baselines by +1.7% on OKVQA zero-shot (5.9% vs 4.2%), demonstrating frozen weights preserve factual knowledge better.	Multimodal Few-Shot Learning with Frozen... (2021)
Unified Any-to-Any Multimodal Alignment	Connect frozen pre-trained encoders and decoders to an LLM core via small projection layers, enabling any-input to any-output multimodal reasoning.	AnyMAL improves +7.0% relative accuracy on VQAv2 zero-shot and +8.4 CIDEr on COCO captioning over prior literature baselines.	NExT-GPT (2023), Any-Modality (2023)
Stepwise Visual Program Reward Modeling	Translate code execution traces into natural-language CoT steps with three-dimensional labels — Relevance, Logic, and Attribute — for fine-grained reward modeling.	Improves +6.3% on SVIP-Test for Qwen2-VL-7B over baseline, and +5.95% average with the SVIP-Reward architecture over standard fine-tuning.	Benchmarking Multimodal CoT Reward Model... (2025)
Scalable Multimodal Reward Models	Combine dual-source preference data from standard VLMs and advanced reasoners with two-stage training, or use activation steering for few-shot reward adaptation.	Skywork-VL achieves state-of-the-art on VL-RewardBench among open-source models; Activation Reward surpasses GPT-4o on the PreferenceHack benchmark for robustness.	Skywork-VL Reward (2025), Activation Reward Models for Few-Shot... (2025)
Cross-Modal Direct Preference Optimization	Optimize preferences across seven combinations of visual, textual, and contextual dimensions to teach models correct cross-modal alignment.	MCM-DPO consistently outperforms standard DPO and SFT baselines on TAlt and PAlt benchmarks, establishing new state-of-the-art for alt-text generation.	Continual SFT Matches Multimodal RLHF... (2024), MCM-DPO (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
VL-RewardBench	Accuracy	State-of-the-art among open-source models	Skywork-VL Reward (2025)
SVIP-Test	Accuracy	+6.3% over baseline for Qwen2-VL-7B	Benchmarking Multimodal CoT Reward Model... (2025)
VQAv2 (Zero-Shot)	Accuracy	+7.0% relative accuracy over baselines	Any-Modality (2023)
PreferenceHack	Robustness Accuracy	Surpasses GPT-4o	Activation Reward Models for Few-Shot... (2025)

⚠️ Known Limitations (4)

Step-level reward annotation requires executable visual programs, limiting applicability to tasks where code-based verification is feasible and excluding abstract reasoning or emotional understanding (affects: Stepwise Visual Program Reward Modeling)
Potential fix: Developing program-free step verification methods or combining code-based and neural-based verification approaches
Multimodal reward models trained on specific VLM outputs may not generalize to new model families or novel task distributions, creating a reward model specialization gap (affects: Scalable Multimodal Reward Models, Stepwise Visual Program Reward Modeling)
Potential fix: Dual-source data curation mixing standard and advanced reasoner outputs, and few-shot activation steering for rapid adaptation to new domains
Preference optimization across multiple modalities increases training complexity and risks catastrophic forgetting of text-only capabilities when fine-tuning on multimodal data (affects: Cross-Modal Direct Preference Optimization, Scalable Multimodal Reward Models)
Potential fix: Two-stage training that first aligns multimodal data then incorporates text-only data to prevent forgetting, as demonstrated by Skywork-VL Reward
Most alignment methods are evaluated on English-centric benchmarks with limited assessment of cross-cultural and multilingual multimodal alignment quality (affects: Visual Prefix Tuning for Frozen LLMs, Unified Any-to-Any Multimodal Alignment, Scalable Multimodal Reward Models)
Potential fix: Creating multilingual multimodal preference datasets and culturally diverse evaluation benchmarks

📚 View major papers in this topic (8)

Multimodal Few-Shot Learning with Frozen Language Models (2021-12) 9
NExT-GPT: Any-to-Any Multimodal LLM (2023-09) 8
Any-Modality Augmented Language Model (AnyMAL) (2023-09) 8
Continual SFT Matches Multimodal RLHF with Negative Supervision (2024-11) 7
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program (2025-04) 8
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning (2025-05) 8
Activation Reward Models for Few-Shot Model Alignment (2025-07) 8
MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation (2025-10) 7

💡 Moving to the next paradigm, we turn to Architecture and Efficiency.

🤖

Architecture and Efficiency

What: Research on designing efficient multimodal model architectures, parameter-efficient adaptation methods, and reinforcement learning techniques that enable scalable deployment of vision-language-action systems.

Why: Deploying multimodal models in real-world settings demands methods that balance capability with computational cost, adaptability, and safety across diverse modalities.

Baseline: Full fine-tuning of large pre-trained models on downstream multimodal tasks, updating all parameters with supervised learning on task-specific labeled data.

Full fine-tuning is computationally prohibitive and causes catastrophic forgetting of pre-trained knowledge across modalities
Fusing heterogeneous modalities (vision, language, audio, depth) while handling missing or noisy inputs remains brittle
Reinforcement learning for multimodal models suffers from training instability, sparse rewards, and poor transfer from language-centric paradigms to visual perception

🧪 Running Example

❓ Deploy a multimodal assistant on a mobile device that can interpret a user's photo of a cluttered desk, listen to their spoken question 'Where did I leave my keys?', and highlight the keys in the image.

Baseline: A fully fine-tuned multimodal model would require billions of parameters updated for this task, consuming excessive memory and latency on a mobile device. It would process audio and vision independently without cross-modal reasoning, likely failing to ground the spoken reference to the correct visual region.

Challenge: This example illustrates three key challenges: (1) the model must be compact enough for on-device deployment yet capable across modalities, (2) it must fuse speech and vision to ground a spoken reference in a cluttered scene, and (3) it must segment the target object precisely — a task where supervised fine-tuning alone struggles without dense pixel-level labels.

✅ Parameter-Efficient Visual Adaptation: Instead of fine-tuning all parameters, lightweight adapters (like LLaMA-Adapter's zero-initialized attention or Mona's depth-wise convolutions) are inserted into a frozen backbone, enabling the model to learn desk-object recognition with <1% trainable parameters and minimal memory.

✅ RL-Enhanced Multimodal Training: Seg-R1's approach trains the model via reinforcement learning to generate point prompts that guide a frozen SAM2 segmenter, learning to highlight the keys without requiring dense pixel-level supervision — just a reward based on final mask quality.

✅ Compact Multimodal Architectures: Phi-4-Multimodal's Mixture-of-LoRA approach integrates vision, speech, and text into a single 3.8B-parameter model by attaching modality-specific LoRA adapters to a frozen backbone, enabling real-time on-device processing of the photo and spoken query simultaneously.

✅ Foundation Model Multi-Modal Adaptation: MM-SAM's unsupervised cross-modal transfer adapts the Segment Anything Model to process depth data alongside RGB with only 0.05% additional parameters, enabling robust object segmentation even in cluttered, occluded scenes.

📈 Overall Progress

The field has undergone two major paradigm shifts: first, from full fine-tuning to parameter-efficient adaptation (2023), proving that freezing backbones and tuning <1% of parameters often surpasses full fine-tuning; second, from supervised fine-tuning to RL-based post-training (2025-2026), where GRPO variants are being systematically adapted for multimodal perception tasks. Simultaneously, compact architectures have matured from single-modality adapters to unified multi-modal models serving vision, language, and speech within a single frozen backbone.

📂 Sub-topics

Parameter-Efficient Fine-Tuning & Adaptation

25 papers

Methods for adapting large pre-trained models to downstream multimodal tasks using minimal trainable parameters, including adapters, prompt tuning, LoRA variants, and spectral-domain techniques that preserve pre-trained knowledge while enabling efficient specialization.

LLaMA-Adapter Mona Self-Prompt Tuning PointGST

Reinforcement Learning for Multimodal Models

30 papers

Techniques applying reinforcement learning — especially Group Relative Policy Optimization (GRPO) and its variants — to enhance multimodal model capabilities in reasoning, perception, segmentation, and GUI interaction, addressing instability, sparse rewards, and visual-domain transfer challenges.

Seg-R1 Dr. Seg R1-Reward MAPLE

Multi-Modal Fusion & Segmentation

30 papers

Approaches for integrating heterogeneous sensor modalities (RGB, depth, thermal, LiDAR, event cameras) into unified representations for segmentation, tracking, and scene understanding, often adapting foundation models like SAM to multi-modal inputs.

MM-SAM MoE-LoRA SAM Grouping Prompt Tuning Sigma

Efficient & Compact Architectures

20 papers

Design of compact multimodal models, hardware-aware acceleration, and compression techniques that enable deployment on resource-constrained devices while maintaining strong performance across vision, language, and speech tasks.

Phi-4-Multimodal Ovis2.5 VLA-Adapter CHARM

Multimodal Agent Architectures & Benchmarks

20 papers

Frameworks and evaluation benchmarks for autonomous multimodal agents that operate in real-world environments — including GUI automation, tool use, robotic control, and proactive assistance — measuring planning, grounding, and safety capabilities.

WindowsAgentArena GTA Benchmark PIRA Framework MobileGUI-RL

💡 Key Insights

💡 Frozen backbones with <1% tunable parameters frequently surpass full fine-tuning on dense visual tasks

💡 RL post-training paradigms designed for language reasoning fail on visual perception without domain-specific adaptations

💡 Modality-aware reward normalization reduces gradient variance by 10-13% and accelerates convergence 3x

💡 Current multimodal agents achieve less than 50% success rate on realistic tool-use and GUI benchmarks

💡 Compact 3-4B models with modality-specific LoRA can match performance of models twice their size

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from isolated efficiency techniques (adapters, pruning) toward holistic multimodal system design — combining parameter-efficient adaptation, RL-based training, modality-aware optimization, and agentic capabilities into unified frameworks that are both compact and capable.

2023-01 to 2023-12 Foundation adaptation and parameter-efficient transfer learning

(CHARM, 2023) demonstrated 32.51x throughput gains for ViT inference by composing heterogeneous accelerators on a single chip
(LLaMA-Adapter, 2023) introduced zero-initialized attention gating for efficient instruction tuning with only 1.2M parameters
ViPT (Visual Prompt Multi-Modal Tracking, 2023) pioneered prompt-based multi-modal tracking, beating full fine-tuning with <1% trainable parameters
PerSAM (Personalize Segment Anything Model with One-Shot, 2023) achieved training-free SAM personalization with just 2 learnable parameters
(SkySense, 2023) set a new standard as a billion-scale multimodal remote sensing foundation model across 16 datasets

🔀 Shift from full fine-tuning to frozen-backbone adaptation, proving that <1% of parameters can match or exceed full fine-tuning performance.

2024-01 to 2024-12 Multi-modal fusion, foundation model adaptation, and agent benchmarks

Mona (5%>>>100%: Breaking Performance Shackles, 2024) broke the full fine-tuning ceiling on dense prediction tasks with multi-cognitive visual adapters
MM-SAM (Segment Anything with Multiple Modalities, 2024) extended SAM to depth, thermal, and LiDAR with +17.5% IoU improvement and only 0.05% additional parameters
MoE-LoRA (Customize SAM with Mixture of..., 2024) introduced dynamic expert routing for robust multi-modal segmentation, gaining +28.14% mIoU on MUSES
(Siamese Mamba Network, 2024) replaced quadratic-complexity transformers with linear-complexity Mamba for efficient multi-modal segmentation
(WindowsAgentArena, 2024) and (GTA, 2024) established realistic agent benchmarks revealing large gaps vs. human performance
(MMAU, 2024) exposed that the best audio model (59.08%) drastically trails human experts (81.85%) on reasoning-intensive audio tasks

2025-01 to 2026-03 RL-enhanced multimodal training, compact architectures, and proactive agents

R1-(R1-Reward, 2025) introduced StableReinforce for stable RL-based reward modeling, achieving +13.5% on VL Reward-Bench
Seg-R1 (Seg-R1, 2025) demonstrated that RL alone can train LMMs to perform segmentation via prompt generation without pixel-level labels
Phi-4-(Phi-4, 2025) unified vision, speech, and text in a 3.8B model via Mixture-of-LoRA, ranking #1 on OpenASR
Dr. (Dr. Seg, 2026) revealed that perception tasks need breadth-first exploration rather than depth-first convergence used in reasoning
(MAPLE, 2026) introduced modality-aware policy optimization, reducing gradient variance by 12.89% and converging 3.18x faster
Fine-R1 (Fine-R1, 2026) advanced fine-grained visual recognition by +23.75% through triplet-augmented RL with chain-of-thought reasoning
(PIRA-Bench, 2026) shifted the agent paradigm from reactive execution to proactive intent recommendation

🔀 Reinforcement learning became a primary post-training paradigm for multimodal models, with GRPO variants tailored for visual perception rather than just language reasoning.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Parameter-Efficient Visual Adaptation	Insert small, learnable modules (adapters, prompts, or spectral transforms) into frozen backbones to bridge domain gaps with minimal parameter overhead.	Mona surpasses full fine-tuning on COCO by +1.0% mAP and Pascal VOC by +3.6% AP using only 5% of trainable parameters. PointGST achieves 99.48% on ScanObjectNN with 0.67% trainable parameters, outperforming full fine-tuning by +1.6%.	LLaMA-Adapter (2023), 5%>>>100%: Breaking Performance Shackles of... (2024), Revisiting the Power of Prompt... (2024), Parameter-Efficient (2024), Adaptive Capacity Allocation for Vision... (2026)
RL-Enhanced Multimodal Training	Replace or augment supervised fine-tuning with policy optimization that uses verifiable rewards (mask IoU, correctness) to train multimodal models for perception and reasoning tasks.	R1-Reward achieves +13.5% on VL Reward-Bench over state-of-the-art via StableReinforce. Dr. Seg gains +2.0 gIoU on ReasonSeg and +2.4 AP on COCO detection over standard GRPO. MAPLE converges 3.18x faster than modality-blind training.	R1-Reward (2025), Seg-R1 (2025), Dr. Seg (2026), MAPLE (2026), Fine-R1 (2026)
Foundation Model Multi-Modal Adaptation	Freeze the foundation model's core weights and inject modality-specific adapters or prompt mechanisms to bridge the gap between RGB-trained representations and new sensor modalities.	MM-SAM improves over vanilla SAM by +17.5% IoU on RGB-Thermal and +28.3% IoU on LiDAR with only 0.05% additional parameters. MoE-LoRA gains +28.14% mIoU on MUSES 3-modality segmentation over state-of-the-art. ViPT beats full fine-tuning by +10.5% success rate on LasHeR with <1% trainable parameters.	Visual Prompt Multi-Modal Tracking (2023), Segment Anything with Multiple Modalities (2024), Customize Segment Anything Model for... (2024), Prompting Multi-Modal Image Segmentation with... (2024), PERSONALIZE (2023)
Compact Multimodal Architectures	Attach modality-specific lightweight adapters to a frozen compact language backbone and use dynamic processing strategies to handle diverse inputs without scaling model size.	Phi-4-Multimodal matches models twice its size on math and coding while ranking first on OpenASR with only 460M speech LoRA parameters. Ovis2.5-9B achieves 78.3 on OpenCompass, setting SOTA for open-source MLLMs under 40B. VLA-Adapter trains a full VLA in 8 hours on a single consumer GPU.	Phi-4-Mini (2025), Ovis2.5 Technical Report (2025), VLA-Adapter (2025), CHARM (2023)
Multimodal Agent Frameworks	Equip multimodal models with environment interaction capabilities (clicking, typing, API calls) and evaluate on realistic, implicit-planning tasks rather than simplified text-only benchmarks.	InfiGUI-G1 achieves up to 9.0% relative improvement over naive RLVR baselines on GUI grounding via Adaptive Exploration Policy Optimization. MobileGUI-RL generates curriculum tasks via random walks and GPT-4o reverse-engineering for scalable online training.	WindowsAgentArena (2024), GTA (2024), PIRA-Bench (2026), InfiGUI-G1 (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
OpenCompass Multimodal Leaderboard	Average Score	78.3 (SOTA for open-source <40B)	Ovis2.5 Technical Report (2025)
ScanObjectNN (OBJ_BG)	Accuracy	99.48%	Parameter-Efficient (2024)
VL Reward-Bench	Accuracy	+13.5% over SOTA (with inference-time scaling)	R1-Reward (2025)
MUSES (3-modality semantic segmentation)	mIoU	+28.14% mIoU over SOTA	Customize Segment Anything Model for... (2024)
MMAU (Massive Multi-Task Audio Understanding)	Accuracy	59.08% (cascaded approach) vs. 81.85% human baseline	MMAU (2024)

⚠️ Known Limitations (4)

Parameter-efficient methods are validated primarily on classification and segmentation but struggle with open-ended generation tasks where the full model capacity is needed for creative and diverse outputs. (affects: Parameter-Efficient Visual Adaptation, Foundation Model Multi-Modal Adaptation)
Potential fix: Model Tailor's sparse patching with Hessian-based decoration selects the minimal parameter subset to update, reducing forgetting while maintaining target task performance.
RL-based multimodal training suffers from reward sparsity and training instability, especially for long-horizon tasks like GUI automation where the outcome is only observable after many steps. (affects: RL-Enhanced Multimodal Training, Multimodal Agent Frameworks)
Potential fix: DeepVideo-R1 reformulates RL as a regression task with difficulty-aware augmentation; SAPO replaces hard clipping with smooth sigmoid gates and asymmetric temperatures to stabilize training.
Multi-modal fusion methods assume all modalities are available at inference time, but real-world deployments frequently face missing, corrupted, or asynchronous sensor inputs that degrade performance. (affects: Foundation Model Multi-Modal Adaptation, Compact Multimodal Architectures)
Potential fix: DrFuse decomposes representations into shared and distinct components so the shared part can be inferred from any available modality. MoE-LoRA's dynamic routing gracefully degrades by assigning zero weight to missing modality experts.
Safety alignment degrades significantly when reasoning capabilities are added to multimodal models, with MLRMs exhibiting 37% higher jailbreaking success rates than base models. (affects: RL-Enhanced Multimodal Training, Compact Multimodal Architectures)
Potential fix: Safe RLHF-V decouples helpfulness and safety into separate reward streams with Lagrangian-constrained optimization. SafeMLRM identifies 'emergent self-correction' where 16.23% of unsafe reasoning chains are overridden by safe final answers.

📚 View major papers in this topic (10)

Phi-4-Mini and Phi-4-Multimodal (2025-04) 9
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery (2023-12) 9
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (2024-10) 9
LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention (2023-03) 8
5%>>>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks (2024-08) 8
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning (2025-05) 8
Segment Anything with Multiple Modalities (2024-08) 8
MAPLE: Modality-Aware Post-training and Learning Ecosystem (2026-02) 8
WindowsAgentArena: Evaluating Multi-Modal OS Agents at Scale (2024-09) 8
Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning (2025-06) 8

💡 Diving deeper into Architecture and Efficiency, let's examine specific research threads that define this area.

📐

Visual Encoders and Projections

What: Research on designing, compressing, and adapting Vision Transformer architectures and their projection layers for efficient multi-modal understanding and deployment on resource-constrained devices.

Why: Vision Transformers are powerful but computationally expensive, limiting deployment on edge devices and real-time multi-modal applications that require fast visual reasoning.

Baseline: Standard full-precision Vision Transformers with multi-head self-attention, extracting final-layer features projected through dense linear adapters into language model space.

Non-normal activation distributions from Softmax and GELU make standard quantization techniques fail on Vision Transformers
Multi-head attention introduces memory access bottlenecks that limit real-world inference speed despite low FLOP counts
Dense parameter-shared projection layers create gradient conflicts when aligning heterogeneous modalities like vision, speech, and text

🧪 Running Example

❓ Deploy a visual question answering model on a mobile phone to read a restaurant menu photographed in dim lighting and extract dish names with prices.

Baseline: A standard ViT-B backbone requires ~17.6 GFLOPs and 86M parameters at full precision, causing 2+ second latency on mobile. The final-layer features miss fine-grained text details, and a dense projection adapter struggles to align visual features with the language model.

Challenge: The menu has tiny text requiring fine-grained spatial detail (lost in aggressive quantization), dim lighting creates outlier activations in LayerNorm (breaking standard quantizers), and the model must project visual features into language space efficiently (single dense adapter bottleneck).

✅ Distribution-Aware Post-Training ViT Quantization: ADFQ-ViT separates outlier activations and uses per-patch quantization, preserving fine text details while compressing the model to 4-bit with only 0.67% accuracy loss.

✅ Memory-Efficient ViT Architecture: SHViT's single-head attention on partial channels and large-stride patchify stem reduce memory access costs by 2.4x, enabling real-time inference on the mobile device.

✅ Sparse Expert Multi-modal Projection: MoE-Adapter routes visual features through specialized experts, avoiding gradient conflicts between text-recognition and scene-understanding pathways during training.

✅ Structured Visual Encoder Pretraining: LLM-supervised structured pretraining (VIVID-Med) produces visual encoders that capture richer semantic relationships, improving downstream text extraction without needing massive pretraining data.

📈 Overall Progress

The field has evolved from adapting CNN compression techniques to ViTs to developing ViT-native solutions that exploit architectural properties like power-law activations and channel-wise outliers. A major paradigm shift occurred around 2024 when researchers recognized that ViT activations require fundamentally different quantization approaches than CNNs. More recently, the focus has broadened beyond pure efficiency to include structured pretraining with LLM supervision, sparse expert-based multi-modal projection, and mechanistic understanding of what visual encoders learn internally — reflecting a maturation from compressing encoders to comprehensively designing them.

📂 Sub-topics

Post-Training Quantization for Vision Transformers

16 papers

Methods for compressing pre-trained ViTs to low bit-widths (3-8 bit) without retraining, addressing unique challenges from non-normal activation distributions in Softmax, GELU, and LayerNorm operations. The largest research cluster in this topic.

TSPTQ-ViT I&S-ViT DopQ-ViT ADFQ-ViT

Efficient Vision Transformer Architectures

3 papers

Macro and micro architectural innovations that reduce memory access costs and computational redundancy in ViTs for real-time deployment on edge devices, including single-head designs and sandwich layouts.

SHViT EfficientViT Cascaded Group Attention

Multi-modal Feature Projection and Alignment

8 papers

Adapter and projection architectures that bridge visual encoders with language models, including sparse expert routing, progressive context extension, layer-wise feature fusion, and modality-specific alignment strategies.

MoE-Adapter Long-VITA VITA-1.5 Layer-wise Feature Fusion

Visual Encoder Pretraining and Fine-tuning

12 papers

Novel strategies for pretraining visual encoders with structured LLM supervision, reinforcement learning rewards, multi-modal contrastive objectives, and efficient adaptation techniques that improve robustness and transferability.

VIVID-Med GRPO-RM VITaL Pretraining Concept-Guided Fine-Tuning

Visual Encoder Interpretability and Robustness

5 papers

Research on understanding internal mechanisms of vision transformers, locating demographic biases at the attention-head level, analyzing action-outcome circuits, and improving robustness to distribution shifts through mechanistic analysis.

Action-Outcome Circuit Analysis Bias-Augmented Residual Analysis HELM

💡 Key Insights

💡 ViT activations require fundamentally different quantization than CNNs due to non-normal distributions

💡 Memory access cost, not FLOPs, is the true bottleneck for on-device ViT inference speed

💡 Middle ViT layers often outperform final layers by 20% for spatial and fine-grained tasks

💡 Frozen LLMs can supervise visual encoder pretraining more effectively than free-form text with far less data

💡 Sparse expert routing resolves gradient conflicts that plague dense multi-modal projection adapters

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research progressed from basic ViT quantization and efficient architectures (2023) through specialized distribution-aware methods and hardware co-design (2024) to structured LLM-supervised pretraining, multi-modal projection, and mechanistic interpretability (2025-2026), reflecting a shift from pure compression to holistic visual encoder design.

2023-05 to 2023-12 Foundational ViT quantization and efficient architecture design

(TSPTQ-ViT, 2023) introduced two-scaled quantization separating activation magnitudes for Softmax and GELU, achieving <0.5% accuracy drop at 8-bit
(EfficientViT, 2023) established memory-efficient ViT design with the sandwich layout and cascaded group attention, running 5.8x faster than MobileViT
I&S-ViT (I&S-ViT, 2023) achieved a breakthrough +50.68% accuracy recovery at 3-bit through the Shift-Uniform-Log2 Quantizer (SULQ) and smooth optimization strategy

2024-01 to 2024-08 Rapid proliferation of specialized ViT quantizers and hardware co-design

(SHViT, 2024) demonstrated that single-head attention on partial channels matches multi-head performance while being 2.4x faster on iPhone 12
P2-(P2-ViT, 2024) pioneered power-of-two quantization with a dedicated hardware accelerator, achieving 10.1x speedup over GPU Tensor Cores
(ADFQ-ViT, 2024) and (DopQ-ViT, 2024) independently tackled outlier-aware and distribution-friendly quantization for ViT activations
PTQ4(PTQ4SAM, 2024) extended ViT quantization to foundation models, addressing bimodal distributions unique to the Segment Anything Model with 3.9x FLOPs reduction
(COMQ, 2024) eliminated backpropagation from PTQ entirely via coordinate descent, achieving <1% accuracy loss at 4-bit

🔀 Research shifted from adapting CNN quantization techniques to designing ViT-native quantizers that explicitly model power-law and bimodal activation distributions.

2024-09 to 2025-06 Multi-modal integration, progressive context scaling, and architecture-informed quantization

VITA-1.5 (VITA-1.5, 2025) demonstrated three-stage progressive training to integrate vision, audio, and speech without modality interference, approaching GPT-4o capabilities
(Long-VITA, 2025) scaled visual-language models to 1 million tokens through phased context-length training with logits-masked inference
(AIQViT, 2025) introduced learnable low-rank adapters alongside quantized weights for architecture-informed compensation across 5 vision tasks
(APHQ-ViT, 2025) replaced GELU with ReLU via knowledge distillation and introduced perturbation-based Hessian estimation, outperforming prior methods by up to 30% at 3-bit

2025-07 to 2026-03 Structured pretraining with LLM supervision, interpretability, and concept-guided adaptation

(Shallower Layers, 2025) proved middle ViT layers outperform the conventionally-used final layers by 20% on spatial tasks, challenging a widespread design assumption
(VIVID-Med, 2026) used a frozen LLM as a structured semantic teacher to pretrain deployable medical ViTs, outperforming BiomedCLIP with 500x less data
(MoE-Adapter, 2026) resolved gradient conflicts in multi-modal projection through sparse expert routing with load-balancing
(Bias in CLIP, 2026) located demographic bias at individual attention heads in CLIP, enabling targeted debiasing by ablating just 4 heads

🔀 Focus expanded from pure compression and efficiency to understanding and steering what visual encoders learn, using LLMs as semantic teachers and mechanistic interpretability to locate and remove biases.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Distribution-Aware Post-Training ViT Quantization	Tailored quantization schemes that split activations by magnitude, use adaptive logarithmic bases, or separate outlier channels to preserve precision where standard uniform quantizers fail.	I&S-ViT improves on RepQ-ViT by +50.68% Top-1 accuracy on 3-bit ViT-B; ADFQ-ViT improves on RepQ-ViT by +10.23% Top-1 on 4-bit ViT-B; APHQ-ViT outperforms PTQ4ViT by +3.65% Top-1 on ViT-B at 4-bit, achieving 78.43%	APHQ-ViT (2025), I&S-ViT: An Inclusive & Stable... (2023), ADFQ-ViT (2024), DopQ-ViT (2024), AdaLog (2024)
Hardware-Accelerated ViT Inference	Power-of-Two (PoT) scaling factors enable pure bit-shift re-quantization, eliminating costly floating-point operations and enabling dedicated ViT accelerator chips with pipelined dataflows.	P2-ViT achieves 10.1x speedup and 36.8x energy savings over GPU Turing Tensor Cores while maintaining 81.39% Top-1 on ImageNet for ViT-B; Trio-ViT delivers 7.3x FPS improvement over ViTCoD accelerator	P2-ViT (2024), Trio-ViT (2024), AIQViT (2025)
Memory-Efficient ViT Architecture	Single-head attention on partial channels combined with large-stride patchification and sandwich FFN layouts eliminate redundant memory operations without sacrificing accuracy.	SHViT-S4 outperforms MobileViTv2-1.0 by +1.3% accuracy while being 2.4x faster on iPhone 12; EfficientViT-M5 surpasses MobileNetV3-Large by +1.9% while running 40.4% faster on V100 GPU	SHViT (2024), EfficientViT (2023)
Sparse Expert Multi-modal Projection	A learnable router directs different modality segments to dedicated experts, isolating conflicting optimization gradients while progressive training prevents cross-modal interference.	MoE-Adapter improves on dense baselines by +3.75% accuracy on OpenBookQA (50.10% to 53.85%) and reduces the audio-text modality gap from -17.83 to -14.67 on MMSU	MoE-Adapter (2026), VITA-1.5 (2025), Long-VITA (2025), Multimodal Language Models See Better... (2025)
Structured Visual Encoder Pretraining	Using LLM-generated structured schemas or RL-based reward functions as supervision produces visual encoders with stronger generalization and fewer spurious correlations than conventional one-hot or free-text training.	VIVID-Med outperforms BiomedCLIP by +6.65 points macro-AUC on CheXpert linear probing (achieving 0.8588) despite using 500x less pretraining data; GRPO-RM achieves +4.26% average accuracy on out-of-distribution datasets over standard fine-tuning	VIVID-Med (2026), GRPO-RM (2025), Pretrained Visual Uncertainties (2024), Concept-Guided Fine-Tuning (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
ImageNet-1K Classification (4-bit W4/A4 ViT-B)	Top-1 Accuracy	81.3% Top-1 Accuracy	AIQViT (2025)
ImageNet-1K Classification (3-bit ViT-B)	Top-1 Accuracy Recovery	+50.68% accuracy recovery over prior state-of-the-art	I&S-ViT: An Inclusive & Stable... (2023)
ImageNet-1K Efficient ViT Inference Speed	Top-1 Accuracy at matched latency	SHViT-S4: 2.4x faster than MobileViTv2-1.0 at +1.3% higher accuracy on iPhone 12	SHViT (2024)
CheXpert Linear Probing	Macro-AUC	0.8588 macro-AUC	VIVID-Med (2026)
ViT Hardware Energy Efficiency	Speedup and energy savings vs GPU Tensor Cores	10.1x speedup, 36.8x energy savings over GPU Turing Tensor Cores	P2-ViT (2024)

⚠️ Known Limitations (4)

Most ViT quantization methods are validated primarily on ImageNet classification; generalization to diverse downstream tasks such as detection, segmentation, and generation remains underexplored and may not transfer directly. (affects: Distribution-Aware Post-Training ViT Quantization, Hardware-Accelerated ViT Inference)
Potential fix: PTQ4SAM and ERQ have begun extending quantization to SAM and dense prediction models; more systematic cross-task evaluation frameworks are needed.
Ultra-low-bit quantization (3-bit) still incurs significant accuracy drops on complex hierarchical architectures like Swin Transformers, making practical deployment below 4-bit challenging for production systems. (affects: Distribution-Aware Post-Training ViT Quantization)
Potential fix: Combining distribution-aware quantizers with low-rank compensation (AIQViT) or GELU-to-ReLU substitution via knowledge distillation (APHQ-ViT) shows promise for pushing below 4-bit.
Efficient ViT architectures sacrifice fine-grained spatial information for speed — single-head designs and large-stride stems aggressively reduce tokens, which may hurt dense prediction tasks requiring per-pixel precision. (affects: Memory-Efficient ViT Architecture)
Potential fix: Hybrid approaches combining partial-channel attention with local depthwise convolutions (as in SHViT) partially address this, but dedicated efficient ViTs for dense prediction remain an open area.
Multi-modal projection methods are evaluated on different benchmarks with different LLM backbones and training data scales, making fair comparison across projection architectures extremely difficult. (affects: Sparse Expert Multi-modal Projection, Structured Visual Encoder Pretraining)
Potential fix: Standardized multi-modal benchmarks like Creation-MMBench are emerging to enable fairer comparisons; the community needs agreed-upon evaluation protocols for projection layers.

📚 View major papers in this topic (10)

💡 Within the same paradigm, another important research direction focuses on Token Compression and Efficient Inference.

🎯

Token Compression and Efficient Inference

What: Research on compressing model representations—through quantization, token pruning, and distillation—to enable efficient inference for vision and multimodal models on resource-constrained devices.

Why: Deploying large vision transformers and multimodal LLMs in real-time applications demands dramatic reductions in memory, latency, and compute without sacrificing accuracy.

Baseline: Full-precision (FP32) models with all visual tokens processed through every layer, requiring maximum memory and compute resources.

Vision transformer activations exhibit extreme outliers and non-normal distributions that break standard quantizers
Visual token redundancy in multimodal LLMs wastes compute on background regions irrelevant to the task
Aggressive compression at ultra-low bit-widths causes catastrophic accuracy collapse in safety-critical applications

🧪 Running Example

❓ A robot must identify and grasp a small red cup on a cluttered table using a Vision-Language-Action model running on an edge GPU with 4GB memory.

Baseline: The full-precision VLA model processes all 576 visual tokens through all 32 LLM layers at FP32 precision, consuming 14GB of memory—far exceeding the 4GB budget—and producing inference at 2 FPS, too slow for real-time grasping.

Challenge: The scene has mostly irrelevant background tokens (table surface, walls), the model's post-Softmax activations follow a power-law distribution that breaks naive INT8 quantization, and the small cup requires preserving fine-grained detail tokens despite compression.

✅ Distribution-Aware Vision Transformer Quantization: RepQuant uses complex quantizers during calibration to handle the power-law Softmax distribution, then reparameterizes them into simple hardware-friendly forms, enabling accurate 4-bit inference while fitting within the 4GB memory budget.

✅ Progressive Visual Token Pruning: METEOR prunes 76% of visual tokens across encoding, fusion, and decoding stages, retaining only the tokens covering the red cup region—reducing FLOPs by 49% while preserving the fine detail needed for grasping.

✅ Dynamic Visual Resolution Routing: InternVL3.5's Visual Resolution Router assigns high resolution to the cup area and compresses the cluttered background, achieving 4× speedup while maintaining nearly 100% accuracy on the task.

📈 Overall Progress

The field has progressed from recognizing that standard quantization fails on vision transformers (2023), through developing specialized solutions for diverse vision domains like SAM, LiDAR, and diffusion models (2024), to integrating quantization with token compression and dynamic routing in production multimodal systems (2025–2026). A key paradigm shift is the move from treating efficiency as purely a post-hoc compression problem to designing architectures with built-in elastic inference capabilities, as seen in Matryoshka representations and visual resolution routing.

📂 Sub-topics

Post-Training Quantization for Vision Models

30 papers

Methods that quantize pretrained vision transformers and related architectures to low bit-widths (3–8 bit) without retraining, addressing challenges like outlier activations, non-normal distributions, and hardware compatibility.

RepQuant APHQ-ViT I&S-ViT COMQ

Visual Token Reduction for Multimodal LLMs

8 papers

Techniques that reduce the number or dimensionality of visual tokens fed into large multimodal models, including token pruning, merging, early-layer bypass, and dynamic resolution routing.

METEOR FMVR DeepInsert VMTC

Knowledge Distillation for Efficient Multimodal Inference

14 papers

Methods that transfer knowledge from large teacher models to compact student models, enabling deployment of multimodal capabilities on edge devices through cross-modal, competitive, or chain-of-thought distillation.

CoMD ARMADA CoT-Drive UDRL-SLM

Discrete Tokenization and Efficient Inference Infrastructure

10 papers

Surveys and frameworks covering discrete tokenizer design, training infrastructure for efficient fine-tuning, and semantic compression for bandwidth-constrained multimodal systems.

SWIFT Discrete Tokenizer Taxonomy RAMSemCom Semantic ID Compression

💡 Key Insights

💡 Matching quantizer shape to activation distribution is critical for ViT accuracy at low bit-widths.

💡 Pruning 76% of visual tokens preserves accuracy while halving compute in multimodal LLMs.

💡 Domain-specific PTQ calibration matches full-precision performance for safety-critical deployments.

💡 Dynamic resolution routing enables 4× inference speedup with negligible accuracy loss.

💡 Small distilled models achieve 94% of large model quality at 80× less computational cost.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from layer-by-layer quantization fixes toward holistic efficiency frameworks that combine multiple compression strategies—quantization, token pruning, distillation, and dynamic resolution—into unified systems capable of adapting compute budgets at inference time.

2023-02 to 2023-12 Foundation of ViT-specific post-training quantization

(TSPTQ-ViT, 2023) introduced two-scaled quantization splitting activations by magnitude for Softmax and GeLU outputs
MRECG (Solving Oscillation Problem in PTQ, 2023) theoretically proved oscillation in PTQ and proposed mixed reconstruction granularity, gaining +6.61% on MobileNetV2
I&S-ViT (I&S-ViT, 2023) achieved stable 3-bit ViT quantization with shift-uniform-log2 quantizer, elevating accuracy by 50.68% over RepQ-ViT
FP8 PTQ study (Efficient Post-training Quantization with FP8, 2023) demonstrated FP8 covers 92.64% of workloads versus only 65.87% for INT8 across 75 diverse models

🔀 Recognition that standard CNN quantization methods fail on Vision Transformers due to fundamentally different activation distributions from Softmax, GELU, and LayerNorm.

2024-01 to 2024-12 Explosion of domain-specific quantization and emergence of visual token compression

(LiDAR-PTQ, 2024) pioneered point-cloud-aware quantization with sparsity calibration, achieving near-lossless INT8 on Waymo with 3× speedup
(RepQuant, 2024) introduced quantization-inference decoupling via scale reparameterization, gaining +30.7% on ViT-S at W4/A4
PTQ4(PTQ4SAM, 2024) solved SAM's bimodal distribution problem with bimodal integration and adaptive granularity quantization
VQ4(VQ4DiT, 2024) enabled 2-bit diffusion transformers through simultaneous codebook and assignment calibration, achieving 3.32 FID
(Visual-Modality, 2024) showed that compressing redundant visual tokens improves MLLM instruction-following by +9.5%

🔀 Shift from generic ViT PTQ to domain-specific quantization (SAM, LiDAR, diffusion, super-resolution) and the first visual token compression methods for multimodal LLMs.

2025-01 to 2026-03 Unified efficiency frameworks combining quantization, token compression, and dynamic routing for production MLLMs

InternVL3.5 (InternVL3.5, 2025) introduced Visual Resolution Router achieving 4.05× speedup while scoring 77.7 on MMMU, narrowing the gap with GPT-5
(METEOR, 2025) demonstrated progressive multi-stage pruning reducing tokens by 76% with only 0.3% accuracy drop across 11 benchmarks
(APHQ-ViT, 2025) replaced Fisher approximation with direct perturbation Hessian for robust ultra-low-bit quantization
(FMVR, 2026) achieved 89% FLOPs reduction via frequency-modulated Matryoshka visual restoration, outperforming FastV by up to 7%
(Render-of-Thought, 2026) pioneered compressing chain-of-thought reasoning into visual tokens for 3–4× token compression with 4.6× speedup

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Distribution-Aware Vision Transformer Quantization	Match quantizer shape to activation distribution (power-law for Softmax, outlier-prone for LayerNorm) rather than forcing a uniform grid.	RepQuant improves on PTQ4ViT by +30.7% accuracy on ImageNet ViT-S at W4/A4, achieving 73.28% Top-1; APHQ-ViT outperforms PTQ4ViT by +3.65% on ViT-B at 4-bit, reaching 78.43%.	RepQuant (2024), APHQ-ViT (2025), I&S-ViT: An Inclusive & Stable... (2023), ADFQ-ViT (2024), DopQ-ViT (2024)
Task-Specific Post-Training Quantization	Design calibration and quantization strategies tailored to the unique activation patterns of each vision domain rather than applying generic ViT PTQ.	LiDAR-PTQ achieves 60.12 mAPH on Waymo CenterPoint-Pillar (INT8), matching FP32 baseline (60.32) and outperforming BRECQ by +3.87 mAPH; PTQ4SAM achieves lossless 6-bit SAM-L with 3.9× FLOPs reduction.	LiDAR-PTQ (2024), PTQ4SAM (2024), VQ4DiT (2024), 2DQuant: Low-bit Post-Training Quantization for... (2024), Post-Training (2025)
Progressive Visual Token Pruning	Identify and remove redundant visual tokens at encoding, fusion, and decoding stages using attention scores, information rank, or frequency decomposition.	METEOR reduces visual tokens by 76% over EAGLE baseline with only 0.3% accuracy drop and outperforms FastV by +4.1% average across 11 benchmarks; FMVR reduces FLOPs by 89% while maintaining ~100% of LLaVA-1.5-7B accuracy.	METEOR (2025), Frequency-Modulated (2026), DeepInsert (2025), Visual-Modality (2024)
Dynamic Visual Resolution Routing	Let the model learn to allocate visual compute adaptively based on content complexity rather than using a fixed resolution for all inputs.	InternVL3.5 achieves 4.05× speedup over InternVL3 with Visual Resolution Router while scoring 77.7 on MMMU, narrowing the gap with GPT-5 to 3.9%; Long-VITA extends context to 1M tokens with 2× prefill speedup.	InternVL3.5 (2025), Long-VITA (2025), Render-of-Thought (2026)
Cross-Modal Knowledge Distillation	Use competitive, chain-of-thought, or manifold-alignment distillation to transfer multimodal understanding from large teachers to lightweight students without requiring multimodal inputs at inference.	UDRL-SLM achieves relevance within 6% of Llama-3 8B with 80× fewer parameters (100M vs 8B) at 338 tokens/second; CoMD's 7B student surpasses its 13B teacher by +1.47% on ScienceQA, reaching 91.83%.	Unlock the Power (2023), From Images to Words: Efficient... (2026), Scaling Multimodal Search and Recommendation... (2025), CoT-Drive (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
ImageNet Classification (ViT PTQ)	Top-1 Accuracy	78.43%	APHQ-ViT (2025)
Waymo Open Dataset (LiDAR PTQ)	mAPH Level 2	60.12 mAPH	LiDAR-PTQ (2024)
MMMU (Multimodal Understanding)	Accuracy	77.7%	InternVL3.5 (2025)
EAGLE Multi-Encoder Benchmarks (Token Pruning)	Average Score across 11 benchmarks	76% token reduction with only 0.3% average accuracy drop	METEOR (2025)

⚠️ Known Limitations (4)

Ultra-low-bit quantization (3-bit or below) still causes significant accuracy drops on complex downstream tasks like detection and segmentation, even with specialized quantizers. (affects: Distribution-Aware Vision Transformer Quantization, Task-Specific Post-Training Quantization)
Potential fix: Combining quantization with low-rank compensation (AIQViT) or MLP reconstruction with activation substitution (APHQ-ViT) can partially mitigate ultra-low-bit degradation.
Token pruning methods rely on attention-based importance scores that may discard visually subtle but semantically critical tokens, particularly for fine-grained tasks like OCR or small object detection. (affects: Progressive Visual Token Pruning, Dynamic Visual Resolution Routing)
Potential fix: Task-adaptive retention strategies (like METEOR's Visual Attention Value for OCR) and frequency-based restoration (FMVR) can recover fine-grained details lost during aggressive pruning.
Knowledge distillation requires access to a powerful teacher model and significant compute for generating training data, creating a dependency bottleneck for resource-constrained teams. (affects: Cross-Modal Knowledge Distillation)
Potential fix: Black-box distillation methods like ARMADA that only need teacher outputs rather than weights, and synthetic data generation approaches, reduce the barrier to effective distillation.
Most PTQ methods are evaluated primarily on ImageNet classification; generalization to diverse downstream tasks (video, 3D, medical imaging) remains underexplored and inconsistent. (affects: Distribution-Aware Vision Transformer Quantization, Task-Specific Post-Training Quantization)
Potential fix: Task-guided supervision losses (LiDAR-PTQ) and temporal-aware calibration (PTQ4VM) demonstrate that incorporating task-specific priors during calibration improves cross-domain generalization.

📚 View major papers in this topic (8)

💡 Within the same paradigm, another important research direction focuses on Multimodal Pretraining and Instruction Tuning.

🔄

Multimodal Pretraining and Instruction Tuning

What: Research on pretraining models to jointly understand multiple modalities (vision, language, 3D, audio) and fine-tuning them to follow complex multimodal instructions.

Why: Enabling AI systems to perceive and reason across diverse data types is essential for real-world applications from medical diagnosis to embodied interaction.

Baseline: Standard approach uses frozen pretrained encoders (e.g., CLIP) with simple linear probing or basic visual question-answering fine-tuning on paired data.

Aligning heterogeneous modality representations into a coherent shared embedding space without losing modality-specific information
Preventing capability degradation of the language backbone when fine-tuning on visual instruction data
Scaling to new domains and modalities with limited paired training data

🧪 Running Example

❓ A doctor uploads a retinal fundus photograph and asks: 'What abnormalities do you see, and should this patient be referred to a specialist?'

Baseline: A generic CLIP-based model can match the image to broad disease categories but cannot produce detailed clinical descriptions, misses subtle findings, and may hallucinate non-existent pathology due to weak vision-language alignment in medical domains.

Challenge: The retinal image requires domain-specific visual understanding (fine-grained lesion detection), language generation aligned with clinical terminology, and the model must avoid confident hallucinations that could mislead clinical decisions.

✅ Domain-Specialized Vision-Language Pretraining: EyeCLIP and VIVID-Med pretrain on ophthalmic image-text pairs with multi-modal alignment, enabling zero-shot recognition of rare diseases from retinal scans with AUROCs up to 0.757

✅ Multimodal Preference Optimization: DPO-based alignment using expert-judged response quality ensures the model produces detailed, helpful clinical descriptions rather than terse or hallucinated outputs

✅ Contrastive Multi-Modal Alignment: Expert-annotated CLIP (eCLIP) integrates radiologist eye-gaze heatmaps as attention signals, reducing the modality gap by 48% and improving cross-modal retrieval for clinical report matching

📈 Overall Progress

The field has evolved from extending CLIP to new modalities (3D, audio, medical) toward sophisticated alignment techniques that preserve language capabilities while improving visual understanding. A key paradigm shift occurred with the discovery of modality degradation and the adoption of preference optimization (DPO) as a lightweight fix. Theoretical foundations now explain why multi-modal learning inherently outperforms single-modal approaches through noise suppression, and practical deployment has been enabled through domain-specific pretraining that achieves competitive performance with orders of magnitude less data.

📂 Sub-topics

Contrastive Vision-Language Pretraining

12 papers

Methods extending CLIP-style contrastive learning to align multiple modalities including vision, language, audio, and 3D data, with innovations in temperature scheduling, multi-view alignment, and theoretical foundations explaining why multi-modal learning outperforms single-modal approaches.

CLIP-guided contrastive alignment Multi-modal temperature scheduling Tri-modal contrastive learning

Instruction Tuning and Preference Alignment

10 papers

Techniques for aligning multimodal LLMs with human preferences through instruction data curation, direct preference optimization, and competitive distillation to improve open-ended conversation quality while preventing language capability degradation.

Distillation-based DPO Competitive distillation Visual self-fulfilling alignment

3D-Language-Image Pretraining

6 papers

Approaches bridging the gap between 3D point cloud understanding and 2D vision-language models through proxy-based alignment, multi-view distillation, and adapter-based transfer learning for embodied interaction and shape retrieval.

Triplet proxy alignment Multi-view distillation Hard contrastive learning for shape retrieval

Domain-Specific Multimodal Pretraining

9 papers

Specialized pretraining frameworks for medical imaging, biosignals, urban computing, and earth observation that adapt general-purpose multi-modal learning to data-scarce, high-stakes domains using knowledge-enhanced and expert-guided strategies.

Knowledge-enhanced pretraining Expert-annotated contrastive learning Leave-one-out contrastive learning

Multimodal Architecture and Mixing Strategies

8 papers

Architectural innovations for combining multiple encoders, mixing model weights from different training domains, and enabling tool-augmented multimodal agents with continual learning capabilities.

Weight-task-embedding mixing Visual tool learning Model merging for continual updates

💡 Key Insights

💡 Multi-modal contrastive learning theoretically eliminates noise memorization that limits single-modal approaches

💡 Preference optimization with just 5K samples reverses language degradation from visual instruction tuning

💡 Domain-specialized pretraining with elite data matches models trained on 100x more private data

💡 Model weight merging outperforms LoRA and full fine-tuning for continual multimodal updates

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from general-purpose contrastive pretraining toward domain-specialized, alignment-aware multimodal models with increasing emphasis on data efficiency, deployment readiness, and theoretical understanding of cross-modal learning dynamics.

2023-03 to 2023-11 Foundation building: Extending contrastive pretraining to 3D and multi-encoder architectures

CLIP2 (CLIP2, 2023) pioneered real-world 3D-language alignment using automatically mined proxy triplets, achieving +253% improvement over PointCLIP on outdoor recognition
(SPHINX, 2023) introduced three-fold mixing of weights, tasks, and visual embeddings to create versatile MLLMs with 90.8 POPE score
(LLaVA-Plus, 2023) established the paradigm of end-to-end visual tool learning with image-grounded planning, reaching 1203 Elo on VisIT-Bench
CoMD (Competitive Distillation for Multi-Modal LLMs, 2023) demonstrated that a 7B student model can surpass its 13B teacher through iterative competitive distillation

2024-02 to 2024-12 Scaling to domains and addressing alignment: preference optimization, domain-specific pretraining, and 3D multimodal LLMs

(Multi-modal Preference Alignment, 2024) first demonstrated that DPO with just 5K distilled samples reverses language degradation from visual instruction tuning
(ShapeLLM, 2024) created the first 3D multimodal LLM for embodied interaction with selective multi-view distillation
(EyeCLIP, 2024) achieved state-of-the-art zero-shot classification across 9 ophthalmic datasets by aligning multiple imaging modalities with clinical text
(SleepFM, 2024) introduced leave-one-out contrastive learning for physiological signals, outperforming supervised CNNs on sleep analysis (AUROC 0.88 vs 0.72)
FoMo-in-Flux (Practitioner's Guide to Continual Multimodal Pretraining, 2024) established model merging as the superior strategy for continual multimodal updates under realistic compute budgets
Signal-Noise Theory (On the Comparison between Multi-modal..., 2024) provided theoretical proof that multi-modal learning fundamentally suppresses noise memorization

🔀 Shift from purely contrastive pretraining to preference-based alignment (DPO) for multimodal models, addressing the newly discovered 'modality degradation' problem where visual fine-tuning harms language capabilities

2025-01 to 2026-03 Refined alignment at scale, deployment-ready specialized models, and theoretical advances in contrastive learning

(OmniAlign-V, 2025) created a 200K high-quality alignment dataset with semantic richness filtering, enabling a 32B model to outperform a proprietary 72B model
(VIVID-Med, 2026) used LLM-supervised structured pretraining to create deployable medical ViTs that outperform BiomedCLIP by +6.65 macro-AUC with 500x less data
(MM-TS, 2026) introduced density-aware temperature and margin scheduling for long-tail robustness in contrastive learning
CVS (Does the Question Really Matter?, 2026) proposed training-free data selection that outperforms full-data training by 4.8% using only 15% of samples
(Visual Self-Fulfilling Alignment, 2026) leveraged the self-fulfilling mechanism to align multimodal safety without explicit safety labels

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Contrastive Multi-Modal Alignment	Pull matching cross-modal pairs together while pushing non-matching pairs apart, using dynamic temperature and density-aware margin schedules to handle concept frequency imbalance.	Improves on standard fixed-temperature CLIP by +69.45% accuracy on ColoredMNIST (82.13% vs 12.68%) through multi-modal signal cooperation that suppresses spurious correlations	On the Comparison between Multi-modal... (2024), MM-TS (2026), Improving Medical Multi-modal Contrastive Learning... (2024), Turbo your multi-modal classification with... (2024)
Multimodal Preference Optimization	Use strong teacher models to generate preference pairs and apply DPO (Direct Preference Optimization) to align weaker models, reversing modality degradation with lightweight data.	Improves on LLaVA-1.5 baseline by +13.6 WildVision Score and +4.9% on MM-Vet, while surpassing the base language model Vicuna on text-only MT-Bench (6.73 vs 6.57)	OmniAlign-V (2025), Multi-modal Preference Alignment Remedies Degradation... (2024), Unlock the Power (2023), MM-Instruct (2024)
3D-Language-Image Pretraining	Align 3D point cloud encoders with frozen 2D vision-language models using automatically mined text-image-point triplets, enabling zero-shot 3D recognition without human labels.	Improves on PointCLIP by +253% relative accuracy on nuScenes zero-shot recognition (37.8% vs 11.7%); ShapeLLM-13B outperforms PointLLM by +5.1% on 3D MM-Vet benchmark	CLIP2 (2023), ShapeLLM (2024), TAMM (2024), MM-Point (2024)
Domain-Specialized Vision-Language Pretraining	Use small high-quality elite datasets or structured clinical knowledge as 'sparks' to guide pretraining on larger unlabeled collections, then optionally discard the teacher for lightweight deployment.	VIVID-Med outperforms BiomedCLIP by +6.65 macro-AUC points on CheXpert using 500x less data; EyeCLIP achieves 0.757 AUROC vs 0.654 for BioMedCLIP on diabetic retinopathy zero-shot classification	VIVID-Med (2026), EyeCLIP (2024), SleepFM (2024), MM-Retinal V2 (2025)
Multi-Modal Architecture Mixing	Mix model weights, visual embeddings from diverse encoders, and training tasks to create versatile multimodal LLMs that combine real-world and synthetic knowledge without domain conflict.	SPHINX achieves 90.8 POPE score surpassing LLaVA-1.5-13B (85.9) and InstructBLIP-13B (78.9); LLaVA-Plus reaches 1203 Elo on VisIT-Bench, outperforming base LLaVA (1095) by 108 points	SPHINX (2023), LLaVA-Plus (2023), A Practitioner's Guide to Continual... (2024)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
nuScenes Zero-Shot Recognition	Accuracy	37.8%	CLIP2 (2023)
CheXpert Linear Probing	Macro-AUC	0.8588	VIVID-Med (2026)
ModelNet40 (Self-supervised 3D Classification)	Accuracy	92.4%	MM-Point (2024)
MM-AlignBench	Win Rate	28.5% win-rate	OmniAlign-V (2025)
VisIT-Bench	Elo Rating	1203 Elo	LLaVA-Plus (2023)

⚠️ Known Limitations (4)

Modality degradation: Visual instruction tuning significantly degrades the language backbone's original text capabilities, requiring careful alignment strategies to mitigate this inherent tension between visual and linguistic learning objectives (affects: Multimodal Preference Optimization, Multi-Modal Architecture Mixing)
Potential fix: Applying lightweight DPO with distilled preferences from stronger models, or using data filtering strategies like Conditional Verdict Shift (CVS) to select only samples requiring genuine visual reasoning
Data scarcity in specialized domains: Medical, 3D, and scientific domains have severely limited paired multi-modal data, constraining pretraining effectiveness and potentially introducing domain biases from small sample sizes (affects: Domain-Specialized Vision-Language Pretraining, 3D-Language-Image Pretraining)
Potential fix: Using elite knowledge sparks from small high-quality datasets, automatic proxy mining from unlabeled scans, and LLM-generated structured supervision to bootstrap pretraining without manual annotation
Demographic and social bias in pretrained encoders: CLIP and similar models encode demographic biases in specific attention heads, which silently propagate to all downstream applications built on these foundations (affects: Contrastive Multi-Modal Alignment)
Potential fix: Mechanistic fairness audits to identify and surgically ablate specific bias-encoding attention heads, reducing gender bias (Cramér's V from 0.381 to 0.362) while preserving or improving accuracy (+0.42%)
Hallucination and over-optimization: Models may hallucinate visual content not present in the image or over-optimize toward proxy rewards, causing quality degradation beyond certain reward thresholds (reward hacking) (affects: Multimodal Preference Optimization, Multi-Modal Architecture Mixing)
Potential fix: Regulated clipping with ratio normalization and gradient balancing to prevent reward hacking; visual self-fulfilling alignment that activates safety personas through exposure to threat-related imagery without explicit safety labels

📚 View major papers in this topic (10)

💡 Moving to the next paradigm, we turn to Video Understanding.

📦

Video Understanding

What: Research on enabling models to reason about spatiotemporal relationships, long-term dependencies, and multimodal evidence within video content.

Why: Videos are the dominant medium for information consumption, demanding AI systems that comprehend dynamic visual scenes and temporal causality.

Baseline: Standard video-language models pair a visual encoder with a decoder-only language model, processing sampled frames without explicit temporal reasoning.

Temporal reasoning across long video sequences with complex causal and event dependencies
Bridging heterogeneous modalities—visual, linguistic, and wireless signals—for robust human-centric understanding

🧪 Running Example

❓ In this 10-minute cooking tutorial, what ingredient was added right before the sauce changed color, and why did that happen?

Baseline: A frame-sampling Video-LMM captions individual frames independently, missing the causal link between the ingredient addition and the color change because the two events span dozens of frames apart.

Challenge: Answering requires temporal localization (finding the exact moment of addition), causal reasoning (linking ingredient chemistry to color change), and long-context reasoning across a 10-minute video—exactly the challenges current models struggle with.

✅ Unified Video-LMM Post-Training Pipeline: SFT teaches the model structured reasoning formats for temporal localization, while GRPO reinforcement learning rewards correct temporal IoU, enabling the model to pinpoint the addition moment and chain causal steps.

✅ Verifier-Guided Iterative Policy Optimization (VerIPO): Iterates between exploration and curation to produce long, self-correcting chain-of-thought reasoning, allowing the model to explicitly reason through 'ingredient added → Maillard reaction → color change' rather than guessing.

✅ Cross-Layer Knowledge-Fusion MoE: Extracts multi-depth 'thought vectors' from the Video-LLM, capturing both low-level visual cues (color shift) and high-level semantic knowledge (cooking chemistry), enabling richer downstream understanding.

📈 Overall Progress

Video understanding has evolved from dataset construction (multi-modal sensing benchmarks) through RL-driven post-training of Video-LMMs (GRPO, DPO, test-time scaling) to knowledge extraction for practical applications. The field has seen a paradigm shift from purely supervised approaches to reinforcement learning pipelines that achieve comparable or superior reasoning with orders-of-magnitude less labeled data. Concurrently, privacy-preserving sensing via wireless signals has matured from dataset creation to sophisticated graph-based pose estimation.

📂 Sub-topics

Video Reasoning with Large Multimodal Models

2 papers

Methods that enhance Video-LMMs with post-training techniques—supervised fine-tuning, reinforcement learning, and test-time scaling—to advance from basic perception to sophisticated temporal and causal reasoning.

Unified Video-LMM Post-Training Pipeline Verifier-Guided Iterative Policy Optimization (VerIPO)

Video Knowledge Extraction for Downstream Applications

2 papers

Approaches that extract and repurpose the world knowledge embedded in Video-LLMs for practical applications such as video recommendation and question generation.

Cross-Layer Knowledge-Fusion MoE Internal Knowledge Graph Question Generation

Non-Intrusive Multi-Modal Human Sensing

2 papers

Privacy-preserving approaches that use wireless signals (mmWave radar, WiFi, LiDAR) instead of cameras for 4D human pose estimation and action recognition.

Multi-Modal Non-Intrusive 4D Sensing Graph Attention Radar Pose Estimation

💡 Key Insights

💡 RL-only video training matches large SFT systems with 27× less data

💡 Iterating GRPO-verifier-DPO produces long reasoning chains 7× faster

💡 Multi-modal wireless sensor fusion significantly outperforms single-modality sensing

💡 Graph attention preserving inter-point relationships reduces radar pose error by 35%

💡 Multi-layer thought vectors retain visual details lost in text-based Video-LLM outputs

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from building foundational multi-modal datasets (2023) to RL-powered video reasoning with LMMs (2025) and graph-based radar sensing (2026), reflecting a broader trend toward scalable, privacy-aware, and reasoning-capable video understanding systems.

2023-05 to 2023-12 Foundations in multi-modal sensing datasets

(MM-Fi, 2023) introduced the first five-modality synchronized dataset for non-intrusive 4D human sensing, establishing benchmarks for wireless pose estimation

2025-05 to 2025-12 Reinforcement learning and knowledge extraction for Video-LMMs

(VerIPO, 2025) demonstrated that iterating between GRPO, verifier curation, and DPO produces long reasoning chains 7× faster than standard GRPO
A comprehensive survey (A Survey of Video Reasoning..., 2025) unified video post-training into SFT, RL, and test-time scaling pillars, documenting that RL-only models can match large SFT-trained systems
(LinkedOut, 2025) proposed extracting multi-layer thought vectors from Video-LLMs for scalable recommendation
(INQUIRER, 2025) leveraged internal knowledge graphs to improve video question generation quality

🔀 Shift from pure supervised fine-tuning to RL-driven post-training (GRPO, DPO) for video reasoning, enabling models to develop long chain-of-thought capabilities with minimal labeled data.

2026-01 to 2026-03 Graph-based methods for privacy-preserving radar sensing

mmGAT (mmGAT: Pose Estimation by Graph..., 2026) applied graph attention networks with mutual edge features to radar point clouds, reducing pose estimation error by 35.6%

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Unified Video-LMM Post-Training Pipeline	Combines supervised fine-tuning, Group Relative Policy Optimization (GRPO), and test-time scaling into a unified post-training framework for Video-LMMs.	GRPO-based models (e.g., Video-RTS) match systems trained on ~165k SFT pairs using only ~6k video-QA triples; test-time scaling saturates after ~5 reasoning samples.	A Survey of Video Reasoning... (2025)
Verifier-Guided Iterative Policy Optimization	A three-stage loop—GRPO generates diverse rollouts, a rollout-aware verifier curates contrastive pairs, and DPO refines the policy toward longer, consistent reasoning.	Achieves 7× faster optimization than standard GRPO; outperforms Video-R1, Kimi-VL-Thinking, and Qwen2.5-VL-7B on VSI-Bench and Video-MME.	VerIPO (2025)
Cross-Layer Knowledge-Fusion MoE	Extracts hidden states from multiple layers of a Video-LLM backbone and uses a Mixture-of-Experts (MoE) router to dynamically select the most relevant abstraction level per video.	Replaces final-layer text output with multi-layer thought vectors, retaining fine-grained visual details lost in conventional text-based video representation.	LinkedOut (2025)
Multi-Modal Non-Intrusive 4D Human Sensing	Synchronizes five sensor modalities via a custom robotic platform, providing 4D spatial-temporal labels for 27 actions across 40 subjects.	Fusing LiDAR + mmWave significantly improves pose estimation over single wireless modalities; ground-truth achieves 95.66% PCKh@0.5 re-projection accuracy.	MM-Fi (2023)
Graph Attention Radar Pose Estimation	Models radar point clouds as directed graphs with a mutual-feature extraction block that computes pairwise attributes (velocity, distance) before graph attention.	Reduces Mean Per Joint Position Error (MPJPE) by 35.6% and PA-MPJPE by 14.1% over state-of-the-art on the mRI dataset.	mmGAT: Pose Estimation by Graph... (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
mRI Dataset (Radar Pose Estimation)	Mean Per Joint Position Error (MPJPE)	35.6% reduction over prior state-of-the-art (absolute MPJPE not reported)	mmGAT: Pose Estimation by Graph... (2026)
Video-MME	Accuracy	Outperforms Video-R1, Kimi-VL-Thinking, and Qwen2.5-VL-7B (absolute score not reported)	VerIPO (2025)
MM-Fi Re-projection (Pose Ground Truth Quality)	PCKh@0.5 (Percentage of Correct Keypoints with head-normalized threshold)	95.66% PCKh@0.5	MM-Fi (2023)

⚠️ Known Limitations (4)

Test-time scaling saturates quickly: performance gains plateau after approximately 5 reasoning samples during self-consistency voting, limiting the benefit of additional compute at inference. (affects: Unified Video-LMM Post-Training Pipeline)
Potential fix: More sophisticated aggregation strategies beyond majority voting, or adaptive sample budgets that allocate more compute only to hard examples.
RL training instability: GRPO-based reinforcement learning for video reasoning can produce unstable improvements in chain-of-thought quality, especially without careful reward design. (affects: Unified Video-LMM Post-Training Pipeline, Verifier-Guided Iterative Policy Optimization (VerIPO))
Potential fix: Verifier-based curation (as in VerIPO) to filter low-quality rollouts, or multi-stage pipelines combining SFT warmup with RL fine-tuning.
Deployment latency of Video-LLMs: Decode-only generation and large model sizes make Video-LLMs impractical for latency-sensitive applications like recommendation systems. (affects: Cross-Layer Knowledge-Fusion MoE)
Potential fix: Extract compact hidden-state representations (thought vectors) offline and use lightweight downstream models for real-time inference.
Wireless sensing accuracy gap: Radar and WiFi-based pose estimation still lags behind camera-based methods in precision, particularly for fine-grained hand and finger movements. (affects: Multi-Modal Non-Intrusive 4D Human Sensing, Graph Attention Radar Pose Estimation)
Potential fix: Multi-modal fusion (e.g., LiDAR + mmWave) and graph-based architectures that preserve spatial relationships between radar points.

📚 View major papers in this topic (6)

💡 Diving deeper into Video Understanding, let's examine specific research threads that define this area.

🔍

Video QA and Captioning

What: Research on enabling AI models to answer natural-language questions about videos and generate grounded textual descriptions, requiring joint visual perception, temporal reasoning, and language generation.

Why: Videos dominate information consumption, yet AI models struggle with temporal dynamics, long-duration content, and grounding responses in specific visual evidence.

Baseline: Uniformly sample video frames, encode them with a frozen visual encoder, and concatenate visual tokens with text for a large language model to generate answers.

Processing long or streaming videos that exceed context windows while preserving temporal coherence across distant events
Achieving fine-grained spatiotemporal perception beyond surface-level recognition to detect subtle actions and rare moments
Grounding textual responses in verifiable visual evidence rather than hallucinating from language priors

🧪 Running Example

❓ In a 30-minute cooking tutorial, at what point did the chef accidentally over-salt the dish, and what corrective steps did they take?

Baseline: A standard Video LLM uniformly samples ~32 frames from 30 minutes, almost certainly missing the 2-second salting mistake. It generates a generic summary like 'The chef prepared a stew' without temporal grounding or causal reasoning about the error.

Challenge: This example requires: (1) long-video processing to scan 30 minutes efficiently, (2) fine-grained perception to notice the brief over-salting moment among routine actions, (3) temporal localization to pinpoint when it happened, and (4) causal reasoning to link the mistake to recovery steps.

✅ Bayesian Surprise Frame Sampling (SPIKE-RL): Tracks the model's belief about expected events and flags the over-salting as a surprising deviation from routine cooking, allocating more frames to that critical moment.

✅ Temporal Chain of Thought: Uses the VLM itself to iteratively identify relevant frame indices based on the question, narrowing from 30 minutes of footage to the critical seconds around the salt event.

✅ Deep Video Discovery Agent: Constructs a searchable database of the video and autonomously uses Browse/Search/Inspect tools to locate the salting incident and trace subsequent recovery actions.

✅ GRPO-CARE (Consistency-Aware RL): Reinforcement learning training ensures the model's reasoning chain logically connects the visual evidence of over-salting to the recovery actions, preventing shortcut answers.

📈 Overall Progress

The field has undergone a fundamental paradigm shift from task-specific video models (2023) through general-purpose video MLLMs with memory augmentation (2024) to RL-trained reasoning systems with explicit evidence grounding (2025-2026). The dominant training paradigm evolved from supervised fine-tuning to GRPO-based reinforcement learning, with consistency-aware and difficulty-aware variants addressing reward hacking. Architecturally, the field converged on token-efficient designs that reduce compute by 5-10x while enabling processing of multi-hour videos.

📂 Sub-topics

Reinforcement Learning for Video Reasoning

15 papers

Methods applying reinforcement learning — primarily Group Relative Policy Optimization (GRPO) and its variants — to improve video MLLMs' reasoning, perception, and captioning capabilities beyond what supervised fine-tuning achieves.

GRPO-CARE EMA-GRPO DE-GRPO RLVR

Chain-of-Thought & Structured Video Reasoning

8 papers

Approaches that decompose video question answering into explicit multi-step reasoning chains, including visual chain-of-thought, tool-augmented reasoning, and interleaved video-text reasoning paradigms.

Temporal Chain of Thought Video-of-Thought ViTCoT CoTasks

Long & Streaming Video Understanding

10 papers

Methods for processing videos ranging from minutes to hours (or continuous streams) by using memory banks, hierarchical representations, agentic search, and online processing to overcome context window limitations.

Memory-Augmented LMM Flash-VStream Think While Watching VideoTree

Efficient Video-Language Architectures

8 papers

Architectural innovations that reduce the computational cost of video LLMs through token compression, codec-aware encoding, encoder-free designs, two-stream projectors, and parameter-space visual alignment.

AIM Token Merging CoPE Codec Primitives ViPE Parameter Alignment SlowFast Projector

Video Understanding Benchmarks & Evaluation

6 papers

Benchmark datasets and evaluation frameworks that assess video MLLMs across temporal reasoning, chain-of-thought quality, long-video comprehension, and complex multi-step inference.

Video-MME MVBench CG-Bench VCR-Bench

Domain-Specific Video Applications

12 papers

Specialized video QA and captioning systems tailored to specific domains including autonomous driving, accident analysis, egocentric multi-agent collaboration, advertisement understanding, spatial reasoning, and temporal grounding.

BEV-InMLLM AdVersa-SD TimeExpert EgoMAS

💡 Key Insights

💡 Reinforcement learning with verifiable rewards outperforms supervised fine-tuning by 10-15% on video reasoning tasks.

💡 Visual perception, not logical reasoning, is the primary bottleneck in video chain-of-thought models.

💡 Token-efficient encoding reduces video LLM compute by 5-10x while maintaining or improving accuracy.

💡 Memory-augmented streaming enables constant-cost processing of arbitrarily long videos.

💡 Surprise-weighted frame sampling consistently outperforms uniform sampling across diverse benchmarks.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has shifted from 'how to encode video for LLMs' toward 'how to reason about video with verifiable evidence.' The 2025 RL revolution made GRPO the de facto post-training method, while agentic tool-use and streaming architectures extended practical video understanding from minutes to days.

2023-02 to 2023-11 Foundation architectures and early video benchmarks establishing the video MLLM paradigm

mPLUG-2 (mPLUG-2: A Modularized Multi-modal Foundation..., 2023) introduced modularized multi-modal pre-training with shared universal layers, achieving SOTA on MSRVTT Video QA (48.0%) and Captioning (80.3 CIDEr)
(MM-AU, 2023) introduced tone transition tracking as a formal task for understanding condensed ad narratives across 8.4K multilingual videos
(MVBench, 2023) established 20 temporal video understanding tasks with VideoChat2 baseline surpassing GPT-4V by 7.6%
(MM-VID, 2023) pioneered the video-to-script generation pipeline for processing hour-long content through GPT-4V
(MM-Narrator, 2023) introduced memory-augmented recurrent generation for audio descriptions spanning hours of video

🔀 Transition from task-specific video models to general-purpose multi-modal LLMs capable of open-ended video conversation.

2024-01 to 2024-12 Comprehensive evaluation frameworks, long-video processing methods, and preference alignment for video captioning

(Video-MME, 2024) created the first full-spectrum video evaluation benchmark across durations and modalities, becoming the de facto standard
(MA-LMM, 2024) introduced online memory-bank processing for constant-cost long video understanding, achieving 60.7% on LVU
(Video-of-Thought, 2024) bridged perception and cognition with scene-graph grounded reasoning chains
BEV-InMLLM (Holistic Autonomous Driving Understanding, 2024) injected Bird's-Eye-View features into MLLMs for holistic autonomous driving understanding across 91K multi-view QA pairs
(AIM, 2024) achieved 6.8x FLOPs reduction via training-free token merging and PageRank-based pruning
LLaVA-Hound-DPO (Aligning Large Multimodal Models with..., 2024) demonstrated caption-based proxy rewards for scalable video RLHF at <$20 for 120K pairs

2025-01 to 2026-03 RL revolution: GRPO-based training, agentic video search, and token-efficient architectures for practical deployment

(GRPO-CARE, 2025) introduced consistency-aware rewards that improved reasoning quality by +24.5% over standard GRPO
V-JEPA 2 (V-JEPA 2, 2025) scaled self-supervised latent video prediction to 1M+ hours, achieving 77.3% on Something-Something v2 and enabling robotic planning
Open-o3-(Open-o3-Video, 2025) introduced curriculum RL for joint spatio-temporal grounding with explicit evidence tags
(Deep Video Discovery, 2025) reframed video understanding as iterative agentic search, achieving 74.2% SOTA on LVBench
Seed1.5-(Seed1.5-VL Technical Report, 2025) achieved SOTA on 38 of 60 public benchmarks using dynamic frame-resolution sampling and hybrid RL
(CoPE-VideoLM, 2026) leveraged video codec structure to enable 8 hours of video within 1M tokens at 86.2% TTFT reduction
(Video-Based, 2026) treated agent evaluation as video understanding with spatiotemporal token pruning, achieving 84.7% accuracy surpassing GPT-5.2
(Think While Watching, 2026) decoupled visual input from text output for concurrent streaming perception and generation with 92.6% latency reduction

🔀 Shift from supervised fine-tuning to reinforcement learning with verifiable rewards as the dominant post-training paradigm for video MLLMs, with GRPO variants appearing in over 15 papers.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
GRPO-Based Video Reinforcement Learning	Train video models with group-relative reward signals — comparing outputs within a batch — to learn robust reasoning without expensive human annotations.	Improves on standard GRPO by +6.7% accuracy on SEED-Bench-R1 Level-3 (GRPO-CARE), and on SFT baselines by +15.6% UAR on DFEW (R1-Omni), achieving state-of-the-art video reasoning.	GRPO-CARE (2025), Open-o3-Video (2025), Video-STR (2025), R1-Omni (2025), Rethinking Chain-of-Thought Reasoning for Videos (2025)
Chain-of-Thought Video Reasoning	Treat selected video frames as 'visual thoughts' analogous to textual chain-of-thought, curating visual context iteratively before generating the final answer.	Temporal CoT improves on standard inference by +11.4 points on LVBench (avg 68-min videos) using the same 32K token budget, achieving state-of-the-art on 4 benchmarks.	Temporal Chain of Thought: Long-Video... (2025), Video-of-Thought (2024), Video-CoT (2025), Thinking With Videos (2025)
Memory-Augmented Long Video Understanding	Decouple video perception from language generation using persistent memory banks that compress, store, and retrieve temporal context on demand.	MA-LMM improves on S5 baseline by +3.8% on LVU benchmark, achieving 60.7% accuracy. Think While Watching reduces time-to-first-token by 92.6% while matching offline accuracy.	MA-LMM (2024), Think While Watching (2026), Deep Video Discovery (2025), Ego-R1 (2025)
Token-Efficient Video Encoding	Exploit the massive redundancy in video frames by merging similar tokens, encoding only visual changes, or transforming video features into lightweight weight updates.	AIM reduces FLOPs by 6.8x over LLaVA-OV-7B with +4.6 points on MLVU when using efficiency gains for more frames. CoPE reduces time-to-first-token by 86.2% over LLaVA-Video-7B.	AIM (2024), CoPE-VideoLM (2026), ViPE (2025), SlowFast-LLaVA-1.5 (2025)
Agentic Video Search & Tool Use	Empower an LLM agent with modular video tools and train it via reinforcement learning to plan optimal tool-use sequences for complex queries.	DVD achieves 74.2% accuracy on LVBench, setting a new state-of-the-art and surpassing all prior works by a large margin. VITAL improves by +11.4% on LongVideo-Reason over the previous best open-source model.	Deep Video Discovery (2025), Ego-R1 (2025), Thinking With Videos (2025), RAVEN (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
Video-MME	Accuracy (%)	81.3%	Video-MME (2024)
MVBench	Average Accuracy (%)	51.1%	MVBench (2023)
LVBench	Accuracy (%)	74.2%	Deep Video Discovery (2025)
QVHighlights	mAP (IoU=0.5)	+2.8% mAP over TRACE	TimeExpert (2025)
SEED-Bench-R1	Accuracy (%)	+6.7% on Level-3 over standard GRPO	GRPO-CARE (2025)

⚠️ Known Limitations (4)

RL training instability and high compute cost: GRPO variants require careful reward design and significant GPU resources, with reward hacking remaining a persistent risk where models find shortcut solutions. (affects: GRPO-Based Video Reinforcement Learning, Chain-of-Thought Video Reasoning)
Potential fix: GRPO-CARE's consistency-aware rewards and FaVChat's data-efficient DE-GRPO demonstrate that adaptive reward mechanisms and sample utility estimation can mitigate instability and reduce data requirements.
Hallucination in video descriptions: Models generate plausible but fabricated details not present in the video, with faithfulness scores as low as 34% before mitigation, due to over-reliance on language priors. (affects: Chain-of-Thought Video Reasoning, Memory-Augmented Long Video Understanding)
Potential fix: Dynamic ad-hoc RAG for cross-verification (ResNetVLLM-2) and caption-based proxy rewards for DPO alignment (LLaVA-Hound-DPO) improve faithfulness from 34% to 98% in controlled settings.
Context window limits for very long videos: Even with compression, multi-hour or multi-day videos exceed model capacity, and aggressive compression risks losing rare but critical events. (affects: Memory-Augmented Long Video Understanding, Token-Efficient Video Encoding)
Potential fix: Hierarchical RAG with tool-based retrieval (Ego-R1) and codec-primitive encoding (CoPE) extend coverage to 8 hours and full weeks respectively, though reliability at scale remains unproven.
Benchmark saturation and evaluation gaps: Models achieve high accuracy on standard MCQ benchmarks through text-based elimination without genuine visual understanding, as shown by significant credibility gaps when grounding is required. (affects: GRPO-Based Video Reinforcement Learning, Chain-of-Thought Video Reasoning)
Potential fix: CG-Bench's clue-grounded evaluation and VCR-Bench's stepwise process scoring offer more rigorous evaluation, but widespread adoption of process-centric metrics is needed.

📚 View major papers in this topic (10)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (2025-06) 9
Video-Based Reward Modeling for Computer-Use Agents (2026-03) 9
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning (2025-06) 8
Seed1.5-VL Technical Report (2025-05) 8
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding (2025-05) 8
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (2023-11) 8
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding (2024-04) 8
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (2023-02) 8
Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence (2025-10) 8

💡 Within the same paradigm, another important research direction focuses on Temporal Reasoning and Action.

📋

Temporal Reasoning and Action

What: Research on enabling multimodal models to understand temporal dynamics in video, including event ordering, moment localization, action recognition, and causal reasoning over time.

Why: Accurate temporal reasoning is essential for video-based AI applications like embodied agents, autonomous navigation, and interactive video assistants.

Baseline: Standard Video LLMs uniformly sample frames and use next-token prediction, treating video as a bag of static images without explicit temporal modeling.

Models exploit static frame shortcuts instead of reasoning about event progression and temporal ordering
Existing benchmarks contain noisy annotations and ambiguous queries that mask true temporal understanding gaps
Long videos overwhelm context windows, causing models to miss brief or rare temporal events

🧪 Running Example

❓ In a 10-minute cooking tutorial video, find the exact moment when the chef adds salt to the boiling pasta and describe what happens immediately after.

Baseline: A standard Video LLM uniformly samples 32 frames from the 10-minute video, likely missing the 2-second salt-adding moment entirely. It guesses 'the chef adds salt around the middle' based on common cooking scripts rather than visual evidence.

Challenge: This example requires temporal localization (pinpointing a 2-second window among 10 minutes), fine-grained perception (distinguishing the brief salt-adding gesture from similar hand movements), and causal reasoning (understanding what follows — stirring the pasta — as a consequence).

✅ Temporal Reinforcement Learning (T-GRPO): By training with contrastive temporal rewards that penalize correct answers from shuffled frames, the model learns to attend to the specific temporal ordering of cooking steps rather than guessing from static scene appearance.

✅ Visual Chain-of-Thought Reasoning: Decomposes the query into steps: first identify frames showing the stove area, then narrow to frames with hand-over-pot gestures, finally locate the salt-adding action — each step curating the visual context for focused reasoning.

✅ Factorized Temporal Grounding: Separates the task into 'find the evidence' (temporal grounding to locate the salt-adding timestamp) and 'describe the event' (generate text about what follows), ensuring the model grounds its answer in specific video moments.

✅ Tool-Augmented Video Reasoning: The model actively clips the relevant 30-second segment around the detected cooking action, re-examines it at higher temporal resolution, and then reasons about the causal sequence with focused visual evidence.

📈 Overall Progress

The field progressed from static benchmark evaluation (2023) through structured reasoning frameworks (2024) to a reinforcement learning revolution (2025) where temporal-aware reward signals became the dominant paradigm. The key paradigm shift was recognizing that standard next-token prediction fundamentally fails to capture temporal dynamics, leading to contrastive RL methods that explicitly penalize static frame exploitation. Concurrently, the community addressed data quality issues, revealing that 20-35% of popular benchmark annotations are flawed.

📂 Sub-topics

Reinforcement Learning for Temporal Reasoning

6 papers

Methods that modify reinforcement learning algorithms (especially GRPO) with temporal-aware rewards to train video models that genuinely understand event progression rather than exploiting static visual shortcuts.

T-GRPO Graph-based RLVR Difficulty-aware GRPO Curriculum Spatio-Temporal RL

Chain-of-Thought Video Reasoning

5 papers

Approaches that decompose video question-answering into structured multi-step reasoning processes, using intermediate visual or symbolic representations to ground each reasoning step in specific video evidence.

Temporal Chain of Thought Video-CoT CoTasks Scene Graph Grounding

Temporal Grounding and Localization

3 papers

Specialized architectures and training recipes for precisely locating event boundaries in videos, including moment retrieval, highlight detection, and dense video captioning with timestamps.

D2VLM TimeLens TimeExpert

Egocentric Activity Understanding

4 papers

Research on understanding activities from first-person viewpoints, requiring inference of the camera wearer's hidden intentions, hand-object interactions, and spatial navigation through dynamic environments.

Reverse Thinking EgoThinker MVExoNet Spatiotemporal MIRC

Benchmarks, Datasets, and Foundation Models

7 papers

Evaluation frameworks, large-scale datasets, and unified foundation models that establish standards and baselines for measuring temporal video understanding capabilities.

MVBench CG-Bench Seed-ViT EMA-GRPO

💡 Key Insights

💡 Contrastive temporal rewards prevent static frame shortcut exploitation in video reasoning

💡 Selective frame curation with 32K tokens outperforms 700K-token brute-force processing

💡 Popular temporal benchmarks contain 20-35% flawed annotations, distorting evaluations

💡 Decoupling temporal localization from text generation yields 20%+ grounding improvements

💡 Tool-augmented active video clipping reduces hallucination in long-video understanding

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research evolved from building temporal benchmarks and datasets (2023-2024) to developing RL-based training paradigms that enforce genuine temporal understanding (2025), with an increasing emphasis on grounded, verifiable reasoning with explicit spatio-temporal evidence.

2023-04 to 2023-11 Foundation benchmarks and egocentric data collection

AssemblyHands (CVPR 2023) established the largest egocentric 3D hand pose benchmark with 3M annotated images using multi-view annotation to overcome occlusion challenges
MVBench (CVPR 2023) introduced 20 systematic temporal video understanding tasks, revealing that existing MLLMs including GPT-4V scored below 50% on temporal reasoning

2024-01 to 2024-11 Benchmark refinement and early structured reasoning

CG-Bench exposed the 'credibility gap' in long-video benchmarks, showing model accuracy drops from ~53% to ~21% when requiring clue-grounded evidence rather than multiple-choice elimination
Video-of-Thought (ICML 2024) pioneered step-by-step video reasoning using spatial-temporal scene graphs as intermediate rationales, bridging perception and cognition
MM-WLAuslan (NeurIPS 2024) curated the first large-scale Australian Sign Language dataset with 282K+ multi-view videos for temporal action recognition

2025-03 to 2025-12 Reinforcement learning revolution for temporal video understanding

Video-R1 (2025-03) pioneered Temporal GRPO with contrastive temporal rewards, establishing the first systematic RL approach for video temporal reasoning
TEMPLE (2025-03) reversed the standard training order by applying preference learning before instruction tuning to establish fundamental temporal alignment
Temporal Chain of Thought (2025-07) demonstrated that self-reflective frame selection with 32K tokens outperforms brute-force 700K-token context windows
VITAL (2025-08) introduced tool-augmented reasoning where models actively clip and re-examine video segments during their reasoning chain
Video-STR (2025-10) extended RL with graph-based verifiable rewards for precise spatio-temporal object relation modeling
D2VLM (2025-11) factorized temporal grounding into evidence finding and text generation stages, achieving +21.6% F1 improvement on grounding benchmarks
TimeLens (2025-12) exposed 20-35% annotation quality issues in popular temporal grounding benchmarks and proposed curated re-annotation with RLVR training

🔀 The field shifted from supervised fine-tuning to RL-based temporal reasoning, with T-GRPO and its variants becoming the dominant training paradigm for enforcing genuine temporal understanding in video models.

2026-03 Human-AI comparative analysis of temporal robustness

Human-AI (2026-03) revealed that AI models degrade more gradually than humans on spatial reduction but show class-dependent sensitivity to temporal scrambling, establishing new metrics for measuring temporal robustness gaps

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Temporal Reinforcement Learning	Contrastive temporal rewards compare model accuracy on ordered versus shuffled frames, penalizing reliance on static visual content.	Video-R1 achieves 37.1% accuracy on VSI-Bench, outperforming GPT-4o; Video-STR improves on Qwen2.5-VL-7B by +13% on STI-Bench, surpassing GPT-4o on spatio-temporal reasoning	Video-R1 (2025), Video-STR (2025), Open-o3-Video (2025), VideoPerceiver (2025)
Visual Chain-of-Thought Reasoning	Selected video frames serve as visual thoughts, enabling focused reasoning on relevant evidence rather than processing entire videos.	Temporal CoT improves by +11.4 points on LVBench (avg 68-min videos) vs standard inference with the same 32K token budget; CoTasks achieves +34.3% accuracy on STAR benchmark for Qwen2.5-VL-3B	Video-of-Thought (2024), Temporal Chain of Thought: Long-Video... (2025), Video-CoT (2025), SG-VLM (2025)
Factorized Temporal Grounding	Decoupling temporal boundary prediction from text generation allows each subtask to be optimized independently with task-specific mechanisms.	D2VLM achieves +21.6% average F1 on E.T. Bench Grounding over E.T.Chat-3.8B (60.2% F1); TimeExpert achieves +2.8% mAP (IoU=0.5) on QVHighlights over TRACE; TimeLens-8B surpasses GPT-5 on TimeLens-Bench	TimeExpert (2025), Factorized Learning for Temporally Grounded... (2025), TimeLens (2025)
Egocentric Spatio-Temporal Reasoning	Reverse thinking — mentally retracing a route backwards — mimics human cognitive processes for spatial recall from egocentric perspectives.	EgoThinker achieves state-of-the-art on EgoTimeQA and Ego-QA benchmarks; AssemblyHands MVExoNet achieves 4.20mm keypoint error, an 85% error reduction from Assembly101's 27.55mm	AssemblyHands (2023), ST-Think (2025), EgoThinker (2025)
Tool-Augmented Video Reasoning	Models 'think with videos' by iteratively clipping and re-examining relevant segments, enabling active visual evidence gathering during reasoning.	VITAL achieves +11.4% accuracy on LongVideo-Reason (79.3% vs 67.9% previous best open-source); +7.3% Recall@1 on VidChapters-7M temporal grounding (34.7% vs 27.4%)	Thinking With Videos (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
STI-Bench	Accuracy	Surpasses GPT-4o (exact score not reported)	Video-STR (2025)
E.T. Bench Grounding	Average F1	60.2% F1	Factorized Learning for Temporally Grounded... (2025)
QVHighlights	mAP (IoU=0.5)	+2.8% mAP over TRACE (absolute score not specified)	TimeExpert (2025)
LVBench	Accuracy	+11.4 points over standard inference baseline	Temporal Chain of Thought: Long-Video... (2025)
NExT-QA (Temporal/Causal)	Accuracy	+23.6% over ViperGPT baseline with InternVL-14B	SG-VLM (2025)

⚠️ Known Limitations (4)

Static frame exploitation — models can achieve high accuracy on many temporal benchmarks by reasoning from individual frames rather than understanding event progression, undermining the validity of temporal evaluations (affects: Temporal Reinforcement Learning (T-GRPO), Visual Chain-of-Thought Reasoning)
Potential fix: Contrastive temporal rewards (T-GRPO) and temporal preference alignment (TEMPLE) explicitly penalize frame-order-invariant answers, but require careful reward calibration
Benchmark annotation quality — 20-35% of samples in popular temporal grounding benchmarks have ambiguous queries or inaccurate timestamps, causing misleading model comparisons and rewarding shortcut learning (affects: Factorized Temporal Grounding, Temporal Reinforcement Learning (T-GRPO))
Potential fix: Manual re-annotation (TimeLens-Bench) and clue-grounded evaluation (CG-Bench) provide higher-quality assessments but are expensive to scale
Long video scalability — context window limitations force uniform frame sampling that misses brief or rare events in videos exceeding 10 minutes, with performance degrading significantly on hour-long content (affects: Visual Chain-of-Thought Reasoning, Tool-Augmented Video Reasoning)
Potential fix: Dynamic segment processing (Temporal CoT) and tool-augmented clipping (VITAL) decouple video length from context limits, but add inference-time computation
Egocentric domain gap — first-person video understanding requires inferring unobservable agent intentions and handling severe hand-object occlusions, for which standard third-person training data provides inadequate supervision (affects: Egocentric Spatio-Temporal Reasoning)
Potential fix: Large-scale egocentric datasets (EgoRe-5M) and multi-view exocentric annotation pipelines (AssemblyHands) help bridge the gap, but collecting diverse egocentric data remains challenging

📚 View major papers in this topic (10)

Video-R1: Reinforcing Video Reasoning in MLLMs (2025-03) 8
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames (2025-07) 8
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs (2025-12) 8
Factorized Learning for Temporally Grounded Video-Language Models (2025-11) 8
Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph (2025-10) 8
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning (2025-08) 8
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (2023-11) 8
CG-Bench: Clue-Grounded Question Answering Benchmark for Long Video Understanding (2024-01) 8
AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation (2023-04) 8
MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset (2024-10) 9

💡 Moving to the next paradigm, we turn to Embodied AI and Robotics.

🔧

Embodied AI and Robotics

What: Research on AI systems that perceive, reason, and act in physical or simulated environments through vision, language, and motor control.

Why: Enabling robots and agents to autonomously perform complex real-world tasks requires closing the loop between perception, reasoning, and physical action.

Baseline: Traditional systems decouple perception, planning, and control into separate hand-engineered modules with task-specific training on each component.

Bridging the sim-to-real gap: policies trained in simulation degrade under real-world noise, dynamics, and visual diversity
Long-horizon reasoning under partial observability: agents must plan over extended sequences with incomplete sensory information
Scaling robot learning: collecting diverse, high-quality demonstration data is expensive and limits generalization

🧪 Running Example

❓ A household robot must find a specific medicine bottle in an upstairs bathroom and deliver it to a person sitting in the living room.

Baseline: A traditional modular pipeline would use a pre-built map for navigation, a fixed object detector for the bottle, and a scripted grasp routine. It would fail if the map is outdated, the bottle looks different than training examples, or obstacles block the planned path.

Challenge: This task requires multi-floor navigation (long-horizon planning), recognizing an object from a language description under visual clutter (vision-language grounding), adapting to unexpected obstacles like a closed door (dynamic replanning), and safely grasping a small object (dexterous manipulation).

✅ End-to-End World-Model RL: The robot learns a latent world model from raw camera images, enabling it to imagine future trajectories and navigate stairs without hand-crafted maps, as demonstrated by Dream to Fly's 100% simulation success rate.

✅ Self-Improving Robotic Foundation Models: After initial imitation learning, the robot autonomously practices grasping the bottle using self-generated reward signals (steps-to-go prediction), improving success from 45% to 75% without additional human demonstrations.

✅ Dual-System Vision-Language Navigation: A slow VLM planner identifies the bathroom as a mid-term goal from the instruction, while a fast diffusion policy controller executes smooth motion at 30Hz, avoiding the high-latency jerky movements of end-to-end approaches.

✅ Object-Centric 3D Scene Graphs: ConceptGraphs builds a semantic map of the house where each object is a node with language-grounded descriptions, allowing the robot to resolve the query 'find the medicine bottle' through LLM reasoning over the graph rather than exhaustive search.

📈 Overall Progress

The field has evolved from modular perception-planning-control pipelines to end-to-end foundation models that learn directly from raw sensory inputs. A critical paradigm shift occurred with the introduction of self-improving training loops, where robots generate their own reward signals for autonomous practice, reducing dependence on expensive human demonstrations. Most recently, the community has shifted focus toward safety-critical evaluation, revealing that even frontier MLLMs suffer from 'causal blindness' when assessing physical consequences in embodied settings.

📂 Sub-topics

Vision-Language Navigation

12 papers

Methods enabling agents to follow natural language instructions to navigate continuous or discrete environments, including topological planning, affordance-based path selection, and map-guided prompting.

Evolving Topological Planning Visual Affordances Prompting Map-Guided Prompting Dual-System Architecture

GUI and Device Control Agents

9 papers

Autonomous agents that interact with graphical user interfaces on smartphones and desktops via vision-based understanding of screenshots, using VLMs for planning and specialized tools for precise element localization.

Offline-to-Online RL for Device Control Vision-Centric Mobile Agent Multi-Agent Collaboration Bi-Level Expert Assimilation

Vision-Based Agile Locomotion and Flight

6 papers

End-to-end policies mapping raw visual inputs directly to motor commands for agile quadrotor flight, quadruped parkour, and legged robot soccer, typically using model-based RL or privileged distillation.

World-Model RL for Flight Constrained RL with Privileged Warm-Start NeRF-Augmented Sim-to-Real

Robot Learning and Manipulation

8 papers

Approaches for learning generalizable manipulation skills through self-improvement, simulation data generation, tool-use transfer from human videos, and few-shot augmentation for dexterous tasks.

Self-Improving Foundation Models LLM-Driven Simulation Generation Tool-as-Interface Transfer

3D Scene Understanding and Spatial Reasoning

12 papers

Building semantic 3D representations for embodied agents, including open-vocabulary scene graphs, reasoning segmentation, multi-frame spatial reasoning, and language-driven 3D scene generation.

Object-Centric 3D Scene Graphs Embedding-as-Mask Segmentation Multi-Frame Spatial Data Engines

Embodied Reasoning, Safety, and Evaluation

14 papers

Benchmarks and methods evaluating embodied agents on physical reasoning, safety-critical decision-making, long-horizon scene prediction, and multi-modal comprehension in diverse environments.

Physical AI Critic Models Cost-Aware Embodied Search Consequence-Driven Safety Alignment

Specialized Robotic Systems and Sensors

12 papers

Domain-specific robotic platforms and sensing technologies including surgical tactile sensors, egocentric AR data platforms, medical navigation, agricultural localization, and space perception.

Vision-Based Tactile Sensing Egocentric Multi-Modal Platforms Domain-Specific SLAM

💡 Key Insights

💡 Self-generated rewards enable robots to surpass imitation learning with 80% less human data.

💡 Decoupling slow reasoning from fast control achieves real-time 30Hz embodied navigation.

💡 Frontier MLLMs exhibit causal blindness, failing to foresee physical consequences in 30-92% of cases.

💡 Pure RL-trained reasoning segmentation outperforms supervised approaches with 100x less data.

💡 World-model imagination enables zero-shot sim-to-real transfer for agile robotic control.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from building foundational perception tools (2023) through scaling autonomous RL-based agents (2024), to self-improving models and dual-system architectures (2025), and now focuses on safety-critical evaluation and consequence-aware alignment for real-world deployment (2026).

2023-04 to 2023-11 Foundational perception, representation, and early VLM-based agents

(ETPNav, 2023) introduced online topological graph construction with ghost-node prediction, winning the CVPR 2022 RxR-Habitat Challenge
(Project Aria, 2023) released a comprehensive egocentric multi-sensor hardware platform with Machine Perception Services for AR research
(LISA, 2023) pioneered the embedding-as-mask paradigm, enabling segmentation from complex implicit queries via LLM reasoning
(ConceptGraphs, 2023) replaced dense feature clouds with structured object-centric 3D graphs for open-vocabulary planning
(GPT-4V, 2023) demonstrated zero-shot GUI navigation using GPT-4V with Set-of-Mark visual grounding

2024-01 to 2024-10 Scaling embodied agents through RL, multi-modal benchmarks, and agile control

(Mobile-Agent, 2024) and its successor Mobile-Agent-v2 (Mobile-Agent-v2, 2024) established vision-centric autonomous mobile device agents with multi-agent collaboration
(DigiRL, 2024) scaled offline-to-online RL for GUI control, achieving 67.2% success on Android-in-the-Wild with a 1.3B model outperforming 18B CogAgent
(GOAT-Bench, 2024) introduced multi-modal lifelong navigation with sequential subtasks testing persistent memory
(SoloParkour, 2024) and the vision-based robot soccer work (Learning Robot Soccer from Egocentric Vision, 2024) achieved agile real-world locomotion from raw visual inputs
GenSim2 (GenSim2, 2024) leveraged reasoning LLMs and multi-modal feedback for scalable simulation task generation, improving real-world success by +21.2%

🔀 Shift from static supervised training to autonomous online RL for embodied agents, demonstrated by DigiRL's +49.5% improvement over supervised baselines on real-world device control.

2025-01 to 2025-12 Self-improvement, dual-system architectures, and multi-frame spatial reasoning

(Self-Improving, 2025) introduced steps-to-go prediction as a self-generated reward, boosting success from 45% to 75% with 10% autonomous practice
(Dream to Fly, 2025) achieved the first autonomous pixel-to-command drone flight using world-model RL without intermediate representations
DualVLN (Ground Slow, Move Fast, 2025) proposed the first asynchronous dual-system VLN model achieving real-time 30Hz continuous control
(Seg-Zero, 2025) demonstrated emergent reasoning segmentation via pure RL (GRPO), surpassing supervised LISA by 18% zero-shot
(Multi-SpatialMLLM, 2025) equipped MLLMs with robust multi-frame spatial understanding, outperforming GPT-4o by 27 points

🔀 Emergence of self-improving foundation models that learn autonomously beyond imitation, reducing dependence on expensive human demonstrations by up to 80%.

2026-01 to 2026-03 Safety-critical evaluation, consequence-aware alignment, and domain-specific deployment

(PhyCritic, 2026) introduced self-referential critic fine-tuning for physical AI, requiring models to solve problems before evaluating others' answers
(Bi-level Expert-to-Policy Assimilation, 2026) achieved +40.5% relative improvement on OSWorld-Verified by converting expert traces into reachable student trajectories
(LabShield, 2026) revealed a 32% performance drop when frontier MLLMs move from text-based MCQs to visual laboratory hazard scenarios
(OOD-MMSafe, 2026) introduced consequence-driven safety alignment (CASPO), reducing risk identification failure from 51% to 5.7%
(MANSION, 2026) generated 1,000+ multi-floor buildings, exposing sharp performance degradation of SOTA agents on vertical navigation tasks

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
End-to-End World-Model RL for Agile Control	Train a world model in latent space from pixels and learn the policy by 'dreaming' inside it, bypassing sample-inefficient real-world interactions.	Achieves 100% gate traversal success in simulation where model-free PPO completely fails (0%), and deploys zero-shot to real drones at 1.5 m/s (Dream to Fly). SoloParkour clears obstacles 1.5x the robot's height, matching privileged teacher performance.	Dream to Fly (2025), SoloParkour (2024), Bootstrapping Reinforcement Learning with Imitation... (2024), Learning Robot Soccer from Egocentric... (2024)
Self-Improving Robotic Foundation Models	Use the model's own predictions (e.g., steps-to-go estimates or VLM-based evaluators) as reward signals for autonomous RL-based self-improvement.	Self-Improving Foundation Models boost real-world success from 45% to 75% with just 10% additional autonomous practice, outperforming 8x more human demonstration data (60%). DigiRL achieves 67.2% on Android-in-the-Wild, a +49.5% absolute improvement over supervised fine-tuning (17.7%).	Self-Improving (2025), DigiRL (2024), From Off-Policy to On-Policy: Enhancing... (2026)
Dual-System Vision-Language Navigation	Separate 'thinking slowly' (VLM-based global planning) from 'moving fast' (diffusion or heuristic local control), connected via latent queries or topological graphs.	DualVLN achieves 0.03s inference latency (30Hz real-time control) versus 0.7s+ for monolithic VLM approaches. ETPNav improves +25.99% Success Rate over RecBERT on RxR-CE and won the CVPR 2022 RxR-Habitat Challenge, doubling the second-best model's score.	Ground Slow, Move Fast: A... (2025), ETPNav (2023), MapGPT (2024)
Reasoning Segmentation via LLM-Grounded Perception	Introduce a special segmentation token in the LLM whose hidden embedding directly prompts a mask decoder, unifying language reasoning and pixel-level perception.	LISA-13B achieves 63.2 gIoU on ReasonSeg, outperforming the specialist model SEEM (25.6 gIoU) by +37.6 points. Seg-Zero achieves 57.5 zero-shot on ReasonSeg, surpassing prior LISA-7B by 18% using pure RL without supervised reasoning traces.	LISA (2023), Seg-Zero (2025), Active-o3 (2025)
Object-Centric 3D Scene Graphs for Embodied Planning	Replace dense per-point feature maps with graph-structured object nodes enriched by VLM captions and LLM-reasoned spatial relationships for scalable embodied planning.	ConceptGraphs improves +16.47 mAcc over ConceptFusion on open-vocabulary 3D segmentation and achieves 0.80 Recall@1 on complex negation queries versus 0.26 for CLIP-based retrieval. MANSION generates 1,000+ multi-floor buildings where SOTA agents show sharp performance degradation.	ConceptGraphs (2023), Scenethesis (2025), MANSION (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
Android-in-the-Wild (AitW)	Task Success Rate	67.2%	DigiRL (2024)
ReasonSeg	gIoU (generalized Intersection over Union)	63.2 gIoU	LISA (2023)
RxR-CE (Room-across-Room Continuous Environment)	Success Rate (SR)	+25.99% SR improvement over RecBERT baseline	ETPNav (2023)
OSWorld-Verified	Task Success Rate	32.13%	From Off-Policy to On-Policy: Enhancing... (2026)
LanguageTable (Real-World Robotic Manipulation)	Task Success Rate	~87-88%	Self-Improving (2025)

⚠️ Known Limitations (4)

Sim-to-real transfer gap: Policies trained in simulation often degrade significantly when deployed in real-world settings due to visual, dynamic, and physical mismatches that domain randomization alone cannot fully address. (affects: End-to-End World-Model RL for Agile Control, Self-Improving Robotic Foundation Models)
Potential fix: NeRF-based rendering for photorealistic simulation backgrounds (Robot Soccer), domain randomization combined with privileged warm-starting (SoloParkour), and geometry-focused representations like point clouds that are more transfer-friendly (GenSim2).
Safety in physical environments: Current embodied agents lack the ability to anticipate hazardous physical consequences of their actions, which is critical for deployment in laboratories, surgical settings, and household environments. (affects: Self-Improving Robotic Foundation Models, Dual-System Vision-Language Navigation)
Potential fix: Consequence-Aware Safety Policy Optimization (CASPO) shifts alignment from intent detection to causal projection, reducing risk failure from 51% to 5.7%. LabShield proposes multi-view visual data with OSHA-standard safety taxonomies.
Scalability of demonstration data: High-quality robotic demonstration data is expensive and time-consuming to collect, limiting the diversity and generalization of learned policies. (affects: Self-Improving Robotic Foundation Models, Object-Centric 3D Scene Graphs for Embodied Planning)
Potential fix: Self-improvement loops using steps-to-go rewards reduce data needs by 80% (Self-Improving FM). GenSim2 automates task generation via reasoning LLMs and visual feedback. Tool-as-Interface reduces collection time by 77% by learning from human videos instead of teleoperation.
Long-horizon reasoning under partial observability: Agents struggle with multi-step tasks in partially observable environments, especially in multi-floor buildings or cluttered rooms where key information is occluded or distant. (affects: Dual-System Vision-Language Navigation, Object-Centric 3D Scene Graphs for Embodied Planning)
Potential fix: Hierarchical chain-of-thought prompting for segment-level decomposition (PM-Nav), persistent memory through topological maps (GOAT-Bench), and cost-aware search strategies that prioritize cognitive retrieval over physical exploration (ESearch-R1).

📚 View major papers in this topic (10)

💡 Diving deeper into Embodied AI and Robotics, let's examine specific research threads that define this area.

✍️

Robotic Manipulation and Control

What: Research on enabling robots to perceive, reason about, and physically manipulate objects using vision-language-action models that unify perception, language understanding, and motor control.

Why: Autonomous manipulation is essential for deploying robots in homes, factories, and unstructured environments where tasks require dexterous, adaptive physical interaction.

Baseline: Supervised fine-tuning of vision-language models on expert demonstrations to directly predict robot actions from camera images and language instructions.

Distribution shift causes compounding errors when robots encounter states unseen during demonstration training
Balancing high-level semantic reasoning with low-latency motor control for real-time manipulation
Sparse reward signals make it difficult to learn precise, long-horizon manipulation behaviors from trial-and-error

🧪 Running Example

❓ Pick up the red mug from a cluttered kitchen counter and place it in the dishwasher rack

Baseline: A standard imitation learning policy maps the camera image directly to motor commands. It fails when the mug is positioned differently from training data, when nearby objects cause visual confusion, or when the mug slips during grasping — the policy cannot recover from errors not present in its training distribution.

Challenge: This task illustrates three key challenges: (1) distribution shift — the mug's exact position and surrounding clutter vary each time, (2) long-horizon execution — the robot must reach, grasp, transport, and precisely place the mug without dropping it, and (3) latency — the grasp phase demands fast reactive control while planning demands slow deliberation.

✅ Flow-Matching Action Generation: Generates continuous, multi-modal grasp trajectories that capture the full distribution of valid grasp angles, enabling precise grasping regardless of mug orientation

✅ RL Post-Training for VLAs: Through online trial-and-error, the robot learns recovery behaviors — like re-grasping after a slip — that were never demonstrated by human operators

✅ Embodied Chain-of-Thought Reasoning: The model reasons step-by-step: 'Locate red mug behind the bowl → plan approach from the left to avoid the plate → grasp handle → lift and transport to dishwasher', improving success on novel arrangements

✅ Dual-System Hierarchical Control: A slow VLM plans the overall sequence while a fast lightweight policy tracks the grasp at 100+ Hz, ensuring stable contact during the delicate pick-up phase

✅ Dense Process Reward Modeling: Provides step-level progress signals (e.g., 'approaching mug: 40% → grasped: 70% → transported: 90%') instead of a single success/fail, enabling efficient RL for each subtask stage

📈 Overall Progress

The field has progressed from simple imitation learning policies to sophisticated VLA architectures that integrate perception, reasoning, and action in unified frameworks. A major paradigm shift occurred with the adoption of reinforcement learning post-training, which broke the 'imitation ceiling' and enabled policies to achieve near-perfect success rates through self-improvement. Simultaneously, dual-system architectures and inference acceleration have made real-time deployment practical, with systems now operating continuously for hours in unstructured public environments.

📂 Sub-topics

VLA Architecture and Foundation Models

14 papers

Core architectural innovations for vision-language-action models, including backbone selection, action representation (discrete tokens vs. continuous flow matching), policy head design, and unified training recipes for generalist robot control.

Flow-Matching VLA Componentized Cognition-Action Parallel Decoding Unified Spatial-Temporal Training

Reinforcement Learning for VLA Post-Training

13 papers

Methods that use reinforcement learning — including PPO, GRPO, and offline RL — to fine-tune pre-trained VLA models beyond supervised imitation learning, addressing distribution shift, reward sparsity, and training instability.

Trajectory-Level PPO GRPO Adaptation Residual RL Stage-Aware Reinforcement

Reasoning-Enhanced Robotic Control

12 papers

Approaches that augment VLA models with explicit chain-of-thought reasoning, visual planning, and structured decision-making to improve generalization, interpretability, and long-horizon task execution.

Embodied Chain-of-Thought Visual Chain-of-Thought Action Reasoning Models Adaptive Reasoning

Vision-Language-Action Models for Autonomous Driving

7 papers

Adaptation of VLA architectures to end-to-end autonomous driving, addressing challenges of physically feasible trajectory generation, adaptive reasoning under varying scenario complexity, and causal understanding for safety-critical decisions.

Reward World Model Adaptive Fast-Slow Thinking Chain of Causation Difficulty-Biased RL

Scalable Training and Efficient Deployment

10 papers

Infrastructure, data generation pipelines, sim-to-real transfer, and inference acceleration techniques that make VLA models practical for real-world deployment, including distillation, asynchronous distributed training, and autonomous data collection.

IMLE Distillation LLM-Guided Data Generation Asynchronous Training Pipeline Entangled Action Pairs

Robotic Hardware and Multi-Modal Sensing

5 papers

Innovations in physical robot design including tactile sensors, compliant grippers, and dexterous hands, as well as multi-modal perception systems that integrate vision, touch, and proprioception for contact-rich manipulation.

Miniaturized Tactile Sensing Hybrid Hard-Soft Compliance Compositional Diffusion Policy

💡 Key Insights

💡 RL post-training breaks the imitation ceiling, enabling 99-100% manipulation success rates

💡 Dual-system architectures achieve 100× control speedup while preserving VLM reasoning

💡 Explicit reasoning traces improve VLA generalization by 28% without additional robot data

💡 Dense process reward models enable sample-efficient real-world RL from near-zero performance

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research evolved from discrete action tokenization and pure supervised learning (2023-2024) to continuous flow-matching generation and RL-enhanced self-improvement (2025), and is now converging on hierarchical fast-slow architectures with explicit 3D spatial reasoning and atomic skill decomposition for scalable, deployable robotic intelligence (2026).

2023-03 to 2024-06 Early foundations in embodied reasoning, tactile sensing, LLM-guided data generation, and initial VLA formulations

(SpiRobs, 2023) introduced bio-inspired soft manipulators with logarithmic spiral morphology, grasping objects varying by two orders of magnitude in size
(Minsight, 2023) demonstrated a fingertip-sized vision-based tactile sensor achieving 60 Hz sensing with 0.07 N force accuracy
(EmbodiedGPT, 2023) pioneered Chain-of-Thought pre-training for embodied agents using the EgoCOT dataset of 2M+ annotated video clips, outperforming BLIP-2 by 22.1%
Language-Guided Skill Acquisition (Scaling Up and Distilling Down, 2023) showed LLMs can generate diverse training data with auto-verification and distill it into robust multi-task diffusion policies (+33.2% over the LLM collector)
The VLA survey (A Survey on Vision-Language-Action Models, 2024) established a hierarchical taxonomy organizing over 50 models into distinct architectural families
(LLaRA, 2024) introduced visuomotor instruction tuning that converts robot data into text-based conversations with self-supervised auxiliary tasks

2024-07 to 2025-02 VLA architecture explosion with flow matching, embodied reasoning, and systematic design principles

ECoT (Robotic Control via Embodied Chain-of-Thought Reasoning, 2024) demonstrated that explicit reasoning traces improve VLA generalization by 28%, outperforming the 55B RT-2-X with only a 7B model
Maniwhere (Learning to Manipulate Anywhere, 2024) achieved zero-shot sim-to-real transfer across 3 hardware setups using multi-view representation learning with spatial transformers
π₀ (π₀, 2024) introduced the flow-matching VLA paradigm, training on 10,000 hours of data across 7 robot configurations for up to 50 Hz control
(CogACT, 2024) showed that separating cognition (VLM) from action generation (Diffusion Transformer) surpasses OpenVLA by 55% in real-world success
RoboVLMs (What Matters in Building VLAs, 2024) established systematic design principles, finding that Policy Head formulation and post-training recipes are critical
(Optimized Fine-Tuning, 2025) identified the optimal recipe of parallel decoding with L1 regression, achieving 97.1% on LIBERO and 26× throughput gain over autoregressive methods
(Magma, 2025) unified spatial-temporal training across 2D and 3D domains, achieving SOTA on both UI navigation and robotic manipulation

🔀 The field shifted from end-to-end imitation learning to structured VLA architectures that separate cognition from action, with π₀ establishing flow matching as the dominant action generation paradigm.

2025-03 to 2025-12 RL revolution for VLA post-training, dual-system maturation, and near-perfect real-world manipulation

(VLA-RL, 2025) formulated robot manipulation as multi-turn RL conversations with a Robotic Process Reward Model, matching commercial π0-FAST performance
(Fast-in-Slow, 2025) repurposed VLM final layers as a fast execution module, achieving 117.7 Hz control and +11% over OpenVLA in real-world tasks
(OneTwoVLA, 2025) unified System 1/2 in a single model with autonomous mode switching, achieving +30% over flat VLA baselines on long-horizon tasks
(SimpleVLA-RL, 2025) demonstrated that GRPO with dynamic sampling achieves 91.7% from a single demonstration, outperforming π₀
(Self-Improving, 2025) achieved 99% simulation and 100% real-world success by training lightweight residual RL agents that correct VLA failures
RL-100 (RL-100, 2025) achieved 100% success across 1000 real-world evaluations and 7-hour continuous operation in a public shopping mall with zero failures
(Robo-Dopamine, 2025) trained a General Reward Model on 3,400+ hours of data enabling one-shot policy adaptation from near-zero to 95% success
Alpamayo-R1 (Alpamayo-R1, 2025) introduced Chain of Causation reasoning for driving, achieving 35% reduction in close encounters and 45% improvement in reasoning quality

🔀 The field shifted from pure imitation learning to RL-enhanced VLAs, with multiple methods achieving 95-100% success rates and demonstrating hours-long real-world operation without human intervention.

2026-01 to 2026-03 Maturation through 3D spatial reasoning, atomic skill decomposition, and deployment-ready hardware

(VLA-Thinker, 2026) introduced thinking-with-image reasoning where perception is a dynamically invocable action, achieving 97.5% on LIBERO and tripling long-horizon success
(GST-VLA, 2026) replaced 2D patches with 3D Gaussian spatial tokens encoding surface geometry and orientation, achieving 96.4% on LIBERO (+2.0% SOTA)
(AtomicVLA, 2026) decomposed tasks into atomic skills with Mixture-of-Experts routing, enabling continual learning of new skills without forgetting (+21% in real-world)
(CRAFT, 2026) introduced hybrid hard-soft compliance achieving 100% success on fragile tasks and full coverage of all 33 Feix grasp taxonomy types
IMLE Distillation (From Flow to One Step, 2026) achieved 123.5 Hz single-step inference (14.3× speedup) via set-level distillation, enabling dynamic re-planning where slow teachers fail
(Thousand-GPU, 2026) reduced training time from 15 hours to 22 minutes (40× speedup) using asynchronous RL-VLA3 architecture

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Flow-Matching and Diffusion-Based Action Generation	Continuous flow matching integrates directly into VLM backbones, generating precise multi-modal action distributions without discretization artifacts.	Surpasses OpenVLA discrete tokenization by +55% success rate in real-world experiments (CogACT), and achieves 97.1% on LIBERO vs. 76.5% for standard OpenVLA (OFT).	π₀: A Vision-Language-Action Flow Model... (2024), CogACT (2024), HybridVLA (2025), Fine-Tuning Vision-Language-Action Models (2025)
Reinforcement Learning Post-Training for VLAs	Online RL enables VLA models to discover recovery behaviors and novel strategies never shown in human demonstrations, breaking the imitation ceiling.	PLD achieves 99% success on LIBERO vs. SFT baselines failing on recovery tasks; SimpleVLA-RL achieves 91.7% on LIBERO-Long with one demo vs. 17.1% for SFT (+74.6%).	Self-Improving (2025), SimpleVLA-RL (2025), RL-100 (2025), StARe-VLA (2025)
Embodied Chain-of-Thought Reasoning	Interleaving semantic reasoning with spatial grounding forces the model to 'look before acting', improving generalization without additional robot data.	ECoT improves OpenVLA by +28% absolute success rate on generalization tasks; VLA-Thinker achieves 97.5% on LIBERO vs. 91.0% for OpenVLA-OFT (+6.5%).	Robotic Control via Embodied Chain-of-Thought... (2024), Fast ECoT (2025), VLA-Thinker (2026), MolmoAct (2025)
Dual-System Hierarchical Control	Inspired by Kahneman's System 1/2 theory, cached semantic features from a slow VLM enable a fast policy to act at 100+ Hz without re-querying the large model.	Fast-in-Slow achieves 117.7 Hz control and outperforms OpenVLA by +11% in real-world tasks; HAMSTER improves over OpenVLA by 20% across seven generalization axes.	Fast-in-Slow (2025), HAMSTER (2025), OneTwoVLA (2025), SaiVLA-0 (2026)
Dense Process Reward Modeling	A general reward model trained on multi-view data predicts relative progress between states, providing policy-invariant dense rewards without altering the optimal policy.	Robo-Dopamine improves success from near-zero to 95% with only 150 rollouts (~1 hour); SARM achieves 83% on real-world T-shirt folding vs. 8% for vanilla Behavior Cloning.	Robo-Dopamine (2025), SARM (2025), VLA-RL (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
LIBERO	Success Rate	99.0%	Self-Improving (2025)
SimplerEnv	Success Rate	98.0%	StARe-VLA (2025)
Real-World Multi-Task Manipulation	Success Rate	100.0%	RL-100 (2025)
CALVIN	Average Successful Task Length	4.25 tasks	What Matters in Building Vision-Language-Action... (2024)

⚠️ Known Limitations (4)

Simulation-to-real transfer gap: Most results are demonstrated in simulation, and policies often degrade significantly when deployed on physical hardware due to visual, dynamic, and kinematic differences. (affects: Flow-Matching Action Generation, Embodied Chain-of-Thought Reasoning, RL Post-Training for VLAs)
Potential fix: Curriculum-based domain randomization (Maniwhere) and robustness-aware regularization (RobustVLA) that penalizes sensitivity to visual and execution perturbations
Inference latency vs. reasoning depth trade-off: Large VLA models with chain-of-thought reasoning generate outputs too slowly for real-time control, with standard ECoT requiring ~5.5 seconds per step. (affects: Embodied Chain-of-Thought Reasoning, Dual-System Hierarchical Control)
Potential fix: Temporal caching and asynchronous reasoning (Fast ECoT achieves 7.5x speedup), dual-system architectures (Fast-in-Slow at 117.7 Hz), and IMLE distillation (123.5 Hz single-step inference)
Data scarcity and embodiment diversity: High-quality robot demonstration data is expensive to collect, and policies trained on one robot body often fail to transfer to different morphologies or end-effectors. (affects: Flow-Matching Action Generation, RL Post-Training for VLAs)
Potential fix: LLM-guided autonomous data generation (Scaling Up and Distilling Down), learning from off-domain data like human videos (ZeroWBC, HAMSTER), and self-resetting collection loops (RoboClaw)
Safety and robustness in unstructured environments: VLA models lack formal safety guarantees and can fail unpredictably when encountering visual clutter, adversarial objects, or out-of-distribution scenarios. (affects: Flow-Matching Action Generation, Embodied Chain-of-Thought Reasoning, Dense Process Reward Modeling)
Potential fix: Subtractive visual distillation that removes clutter from inputs (CGVD improves by +34.5%), Jacobian regularization for input sensitivity (RobustVLA), and closed-loop verification with error recovery (Agentic Robot)

📚 View major papers in this topic (10)

Robotic Control via Embodied Chain-of-Thought Reasoning (2024-07) 9
π₀: A Vision-Language-Action Flow Model for General Robot Control (2024-10) 8
Self-Improving Vision-Language-Action Models with Data Generation via Residual RL (2025-10) 9
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning (2025-09) 9
RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning (2025-10) 9
Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation (2025-12) 9
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (2025-02) 9
Magma: A Foundation Model for Multimodal AI Agents (2025-02) 9
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail (2025-10) 9
VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning (2026-03) 8

💡 Within the same paradigm, another important research direction focuses on Autonomous Driving.

🔗

Autonomous Driving

What: Research on enabling vehicles to perceive, reason about, and navigate complex traffic environments autonomously using multi-modal sensors and learned decision-making models.

Why: Safe and reliable self-driving requires bridging perception, reasoning, and planning in dynamic, unpredictable environments with diverse road users and rare edge cases.

Baseline: Traditional modular pipelines with separate perception, prediction, and rule-based planning components connected through hand-crafted interfaces and HD maps.

Long-tail scenarios with rare events lack sufficient training data, causing brittle failures in safety-critical situations
Bridging high-level semantic reasoning with physically feasible, temporally consistent trajectory generation remains difficult
Fusing heterogeneous sensor modalities while handling calibration errors, occlusions, and adverse weather conditions

🧪 Running Example

❓ A self-driving car approaches a construction zone where workers partially block the lane, an oncoming vehicle begins overtaking, and road markings are obscured.

Baseline: A traditional modular pipeline detects the static barriers via LiDAR and camera but fails to predict the workers' intent to cross, cannot reason about the oncoming overtake as a coordinated social interaction, and generates a jerky stop-and-go trajectory due to conflicting rule-based heuristics.

Challenge: This scenario is a long-tail event rarely seen in training data, requires understanding human intent and social negotiation, demands robust sensor fusion under unusual road conditions, and needs temporally smooth planning that respects vehicle dynamics.

✅ RL-Enhanced VLA Driving: Alpamayo-R1 would generate a Chain of Causation linking 'workers near lane' → 'must yield' → 'decelerate and shift laterally', producing a kinematically feasible avoidance trajectory refined via reinforcement learning rewards for safety and comfort.

✅ Chain-of-Thought Reasoning for Driving: PKRD-CoT would force the model through explicit steps — perceive construction zone, recall traffic rules for work zones, reason about worker movement, then decide to slow and merge — providing an interpretable decision trace for validation.

✅ Multi-Modal Sensor Fusion: MSeg3D would fuse LiDAR geometry with camera imagery to detect partially occluded workers and equipment even when some points fall outside the camera's field of view, using cross-modal feature completion to fill perception gaps.

✅ Driving World Models: Drive-OccWorld would imagine multiple future occupancy states conditioned on different ego-actions (brake, swerve, wait), evaluating each candidate trajectory against predicted collisions before committing to the safest path.

📈 Overall Progress

The field has undergone two major paradigm shifts: first, from modular pipelines to end-to-end learned systems (2023–2024), and then from pure imitation learning to reinforcement-learning-enhanced VLA models with structured reasoning (2025–2026). Multi-modal perception matured from basic LiDAR-camera concatenation to robust semantic fusion handling adverse conditions and missing modalities. Planning evolved from deterministic trajectory generation to probabilistic, momentum-stabilized approaches with world-model-based safety verification.

📂 Sub-topics

Vision-Language-Action Models for Driving

7 papers

End-to-end driving architectures that combine vision-language understanding with action generation, typically refined via reinforcement learning to produce physically feasible trajectories beyond imitation learning.

Reward World Model (IRL-VLA) Dr. GRPO (NoRD) Chain of Causation (Alpamayo-R1) Adaptive Fast/Slow Thinking (AdaThinkDrive)

End-to-End Planning and Trajectory Optimization

5 papers

Methods that replace modular planning pipelines with learned systems that directly score, generate, or refine trajectory candidates from sensor inputs, handling multi-modal driving behavior and temporal consistency.

Generalized Trajectory Scoring (GTRS) Probabilistic Planning (VADv2) Momentum-Aware Planning (MomAD)

Reasoning, Chain-of-Thought, and Interpretability

7 papers

Approaches that enhance autonomous driving with structured reasoning chains, retrieval-augmented learning, and human-feedback mechanisms to improve decision interpretability and generalization.

PKRD-CoT DriveCoT RAG-Driver Physiological-LLM-RLHF

Multi-Modal 3D Perception and Sensor Fusion

7 papers

LiDAR-camera fusion methods for 3D object detection, semantic segmentation, occupancy prediction, and map construction that handle modality heterogeneity, field-of-view mismatches, and adverse conditions.

MSeg3D RoboFusion LaserMix++ Co-Occ

World Models and Trajectory Prediction

5 papers

Internal predictive models that forecast future environment states or agent trajectories, enabling safer planning through imagination-based evaluation and handling variable-length or incomplete observations.

Drive-OccWorld Tokenized Intent World Model (TIWM) Kinematics-Aware Latent WM Progressive Retrospective Framework

💡 Key Insights

💡 Reinforcement learning transforms VLA models from passive imitators to adaptive driving agents

💡 Adaptive reasoning depth saves 14% inference time by bypassing chain-of-thought in simple scenarios

💡 Cross-modal feature completion enables robust perception even with complete camera failure

💡 Trajectory momentum and probabilistic vocabularies eliminate jittery one-shot planning failures

💡 World models reduce real-world data needs by enabling policy training entirely in imagination

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has rapidly converged on VLA architectures as the dominant paradigm, with key innovations in adaptive reasoning depth (fast vs. slow thinking), data-efficient RL training, and cognitive world models that prioritize task-relevant abstraction over pixel-level reconstruction.

2023-03 to 2023-12 Foundations of multi-modal perception and early LLM integration for driving

MSeg3D (MSeg3D, 2023) introduced semantic-based fusion and cross-modal feature completion, achieving 81.14 mIoU on nuScenes even with zero cameras available
(Multi-Modal, 2023) systematized LiDAR-camera fusion approaches into a unified taxonomy
(DriveMLM, 2023) was among the first to align LLM outputs with standardized vehicle control states, achieving 76.1 Driving Score on CARLA Town05 Long

🔀 LLMs were first bridged to vehicle control through standardized behavioral planning states, moving beyond pure language outputs.

2024-01 to 2024-12 Chain-of-thought reasoning, probabilistic planning, and robust multi-modal perception

VADv2 (VADv2, 2024) pioneered probabilistic planning with a 4,096-trajectory vocabulary, achieving SOTA closed-loop driving on CARLA
(RAG-Driver, 2024) introduced retrieval-augmented in-context learning for zero-shot driving generalization without fine-tuning
(RoboFusion, 2024) adapted the Segment Anything Model (SAM) for robust 3D detection under adverse weather, improving +6.51% mAP on corrupted benchmarks
TOKEN (Tokenize the World into Object-level Knowledge, 2024) addressed long-tail failures by tokenizing the world into structured object-level representations, reducing collision rates by up to 100% in specific scenarios
(PlanAgent, 2024) demonstrated the first closed-loop mid-to-mid MLLM planning agent, outperforming both rule-based and learning-based baselines on nuPlan
(PKRD-CoT, 2024) designed structured chain-of-thought prompting that improved driving decision accuracy by 22% over standard zero-shot approaches

2025-01 to 2026-03 VLA revolution with reinforcement learning, adaptive reasoning, cognitive world models, and data efficiency

(MomAD, 2025) introduced trajectory and perception momentum, reducing collision rate by 26% and improving trajectory consistency by 33% over SparseDrive
(Generalized Trajectory Scoring, 2025) won the NAVSIM v2 Challenge with 49.4 EPDMS using a super-dense 16k trajectory scorer combined with diffusion-based generation
(IRL-VLA, 2025) proposed a Reward World Model via Inverse RL, eliminating expensive sensor simulation for VLA training and securing 1st runner-up in the CVPR 2025 Grand Challenge
Alpamayo-R1 (Alpamayo-R1, 2025) achieved the highest breakthrough with causally-grounded reasoning that uses RL to align reasoning with action, improving safety by 35%
(NoRD, 2026) proved that VLAs can drive competitively with 60% less data and zero reasoning annotations using difficulty-aware Dr. GRPO optimization
(CoT, 2025; Reasoning in AD Survey, 2026) formalized the evolution from rule-driven to knowledge-driven autonomous driving paradigms

🔀 The field shifted from imitation-only training to RL-enhanced VLA models with adaptive reasoning depth, representing a move from data-driven to knowledge-driven autonomous driving.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Reinforcement-Learning-Enhanced VLA Driving	Uses RL reward signals from world models or physics constraints to refine VLA policies, with adaptive fast/slow reasoning for efficiency.	Improves on standard GRPO-based VLAs by +11.68% PDM score using Dr. GRPO on NAVSIM; Alpamayo-R1 achieves +12% planning accuracy and 35% reduction in close encounter rate over trajectory-only baselines.	Alpamayo-R1 (2025), IRL-VLA (2025), NoRD (2026), AdaThinkDrive (2025)
Chain-of-Thought Reasoning for Driving	Forces models through explicit cognitive stages (perceive, recall knowledge, reason, decide) that mimic human driving cognition for interpretability.	PKRD-CoT improves decision-making accuracy by +22% over standard zero-shot prompts in ablation studies; GPT-4 achieves 100% accuracy in mathematical reasoning tasks within the framework.	PKRD-CoT (2024), DriveCoT (2024), RAG-Driver (2024)
Multi-Modal Sensor Fusion for 3D Perception	Uses cross-modal semantic alignment and adaptive feature gating to combine complementary strengths of sparse LiDAR geometry with dense camera texture.	MSeg3D achieves 81.14 mIoU on nuScenes test, +1.18 over previous best 2D3DNet; RoboFusion improves +6.51% mAP on KITTI-C (corrupted) over TransFusion baseline.	MSeg3D (2023), RoboFusion (2024), Multi-Modal (2024)
End-to-End Trajectory Planning and Scoring	Discretizes continuous planning into large trajectory vocabularies and uses learned scoring or probabilistic sampling to select temporally consistent optimal paths.	GTRS achieves 49.4 EPDMS on NAVSIM v2 Challenge (winning entry), approaching privileged planner PDM-Closed; MomAD reduces collision rate by 26% and improves trajectory consistency by 33% over SparseDrive.	Generalized Trajectory Scoring for End-to-end... (2025), VADv2 (2024), Don't Shake the Wheel: Momentum-Aware... (2025)
Driving World Models	Predicts future environment states using action-conditioned generative models, allowing candidate trajectories to be evaluated in imagination before execution.	Drive-OccWorld improves occupancy forecasting by +9.5% mIoU and +5.1% VPQ over prior methods on nuScenes; Kinematics-Aware WM achieves +23.1% Mean Return over image-only world model baselines.	Driving in the Occupancy World:... (2024), Constructing the Umwelt (2025), Kinematics-Aware (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
NAVSIM v2 (Navhard)	EPDMS (Ego-Pseudo Driving Metric System)	49.4 EPDMS	Generalized Trajectory Scoring for End-to-end... (2025)
CARLA Town05 Long	Driving Score (DS)	76.1 DS	DriveMLM (2023)
nuScenes Test (3D Segmentation)	mIoU (mean Intersection over Union)	81.14 mIoU	MSeg3D (2023)
KITTI-C (Corrupted)	mAP (mean Average Precision)	+6.51% mAP over TransFusion baseline	RoboFusion (2024)

⚠️ Known Limitations (4)

Sim-to-real domain gap: Models trained in simulators (CARLA, NAVSIM) or with synthetic corruptions may not transfer reliably to real-world driving conditions with novel sensor noise and lighting. (affects: Reinforcement-Learning-Enhanced VLA Driving, Driving World Models, End-to-End Trajectory Planning and Scoring)
Potential fix: IRL-VLA proposes Reward World Models that bypass sensor simulation entirely; domain randomization and progressive real-world fine-tuning are emerging strategies.
Computational overhead of reasoning: Chain-of-thought and VLA reasoning add significant latency, which conflicts with the real-time requirements of autonomous driving at highway speeds. (affects: Chain-of-Thought Reasoning for Driving, Reinforcement-Learning-Enhanced VLA Driving)
Potential fix: AdaThinkDrive's adaptive fast/slow mechanism bypasses reasoning in 84% of simple scenarios; NoRD eliminates reasoning annotations entirely while maintaining competitive performance.
Long-tail data scarcity: Rare but safety-critical scenarios (construction zones, emergency vehicles, unusual pedestrian behavior) remain severely underrepresented in training datasets. (affects: Reinforcement-Learning-Enhanced VLA Driving, End-to-End Trajectory Planning and Scoring, Multi-Modal Sensor Fusion for 3D Perception)
Potential fix: TOKEN uses object-level tokenization to leverage LLM reasoning for long-tail generalization; Alpamayo-R1's causal reasoning enables systematic handling of novel scenarios through compositional understanding.
Benchmark-reality disconnect: Current benchmarks primarily evaluate in constrained settings and may not capture the full complexity of real-world social interactions and edge cases. (affects: Chain-of-Thought Reasoning for Driving, End-to-End Trajectory Planning and Scoring, Driving World Models)
Potential fix: Both surveys identify the need for benchmarks that test social-cognitive reasoning, adversarial interactions, and multi-agent negotiation beyond current structured evaluation protocols.

📚 View major papers in this topic (10)

💡 Within the same paradigm, another important research direction focuses on World Models and Simulation.

⚙️

World Models and Simulation

What: World models learn to predict future environment states given actions, enabling embodied agents to simulate outcomes and plan without costly real-world trial-and-error.

Why: Embodied agents must anticipate consequences of actions to plan safely, especially in driving and manipulation where real-world mistakes are dangerous and irreversible.

Baseline: Imitation learning policies that directly map observations to actions without internal forward simulation or explicit dynamics reasoning.

Pixel-level future prediction is computationally expensive and often produces physically implausible long-horizon forecasts
Sim-to-real domain gaps cause world models trained in simulation to fail in real-world deployment
Learned latent representations frequently lack geometric and kinematic structure needed for safe planning

🧪 Running Example

❓ An autonomous vehicle must execute an unprotected left turn at a busy intersection with oncoming traffic and pedestrians.

Baseline: An imitation learning policy replays left-turn trajectories from training data but cannot adapt to the specific timing of oncoming cars or pedestrian positions, risking a dangerous merge.

Challenge: The vehicle must predict how oncoming cars will decelerate or maintain speed, whether a pedestrian will enter the crosswalk, and evaluate multiple trajectory options—requiring forward simulation of a dynamic multi-agent scene over several seconds.

✅ Action-Conditioned Occupancy Forecasting: Forecasts 3D occupancy grids for multiple candidate turning trajectories, selecting the path with lowest predicted collision probability given forecasted traffic flow.

✅ Inverse RL Reward World Models: A lightweight neural reward model scores each candidate trajectory for safety, comfort, and traffic rule compliance without running a full physics simulator.

✅ Pre-trained Feature Latent Dynamics: Plans in a compact latent space derived from foundation model features, evaluating turn-timing options via Model Predictive Control without task-specific simulator training.

✅ Explicit Motion-Reasoning World Models: Predicts how pixels move (via optical flow or intent tokens) before predicting future appearance, producing physically plausible forecasts of traffic dynamics.

📈 Overall Progress

World models have evolved from pixel-level generative approaches to structured latent-space methods that leverage pre-trained foundation model features. A major paradigm shift occurred with the decomposition of monolithic next-frame prediction into explicit reasoning chains (flow, intent tokens). The field has also expanded from single-domain applications to specialized variants for driving, manipulation, anomaly detection, and planetary-scale environmental monitoring.

📂 Sub-topics

Autonomous Driving World Models

4 papers

World models specifically designed for self-driving that forecast future road scenes—via occupancy grids, reward functions, or intent tokens—conditioned on ego-vehicle actions to enable safe trajectory planning.

Action-Conditioned Occupancy Forecasting Inverse RL Reward World Models Cognitive Intent World Modeling

Robotic Manipulation World Models

3 papers

World models for robot manipulation tasks that learn dynamics in latent spaces—using pre-trained visual features or motion decomposition—to enable zero-shot planning and failure detection.

Pre-trained Feature Latent Dynamics Explicit Motion-Reasoning World Models

Foundation World Model Frameworks and Surveys

3 papers

Conceptual frameworks and comprehensive surveys that define the theoretical underpinnings of world models for embodied AI, including causal reasoning requirements and VLA taxonomies.

Foundation Veridical World Models VLA Hierarchical Taxonomy

Planetary-Scale World Models

1 papers

World models that extend to Earth-scale environments using 4D space-time encodings, enabling self-supervised multi-modal learning for environmental monitoring across vast spatial and temporal ranges.

Planetary-Scale 4D Space-Time World Models

💡 Key Insights

💡 Pre-trained visual features enable zero-shot world model planning without task-specific training.

💡 Explicit motion reasoning prevents pixel-copying and improves physical plausibility of predictions.

💡 Lightweight reward world models bypass expensive simulators for closed-loop RL policy optimization.

💡 Kinematics-grounded latent spaces dramatically reduce data requirements for driving policy learning.

💡 World models double as anomaly detectors with statistical safety guarantees for deployment.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from conceptual frameworks and surveys (2024) through VLA-integrated cognitive architectures with explicit reasoning (2025) to domain-specialized, data-efficient models with safety guarantees for real-world deployment (2026).

2024-02 to 2024-11 Conceptual foundations and early structured world models

The causality framework (The Essential Role of Causality..., 2024) articulated why foundation models need causal reasoning for embodied AI, proposing Foundation Veridical World Models (FVWMs).
A comprehensive VLA survey (A Survey on Vision-Language-Action Models..., 2024) organized Vision-Language-Action models into a hierarchical taxonomy spanning components, control, and planning.
Drive-OccWorld (Driving in the Occupancy World, 2024) demonstrated that 4D occupancy forecasting conditioned on ego-actions improves planning safety, gaining +9.5% mIoU on nuScenes.
(DINO-WM, 2024) showed that building world models on frozen DINOv2 features enables zero-shot planning, improving success rate by 45% over IRIS.

🔀 Shift from pixel-level generative world models to structured latent-space and pre-trained-feature-based approaches that prioritize planning utility over visual fidelity.

2025-06 to 2025-10 VLA integration and cognitive world model architectures

Meta's embodied (Embodied AI Agents, 2025) proposed unifying mental and physical world models under JEPA-based architectures, releasing the 4,000-hour Seamless Interaction dataset.
(IRL-VLA, 2025) introduced Reward World Models via inverse RL, achieving 1st runner-up at the CVPR 2025 Autonomous Grand Challenge with 45.0 EDPMS on NAVSIM v2.
(FlowVLA, 2025) introduced Visual Chain of Thought that predicts optical flow before appearance, achieving state-of-the-art on CALVIN manipulation benchmarks.
(Constructing the Umwelt, 2025) replaced dense reconstruction with sparse Intent Tokens via Belief-Intent Co-Evolution for cognitively-inspired planning.

🔀 Emergence of explicit reasoning steps (optical flow, intent tokens) within world models, moving beyond monolithic next-frame prediction to decomposed prediction pipelines.

2026-01 to 2026-03 Scaling to specialized domains and robust real-world deployment

(Self-Supervised, 2026) scaled world models to planetary dimensions via 4D hash encoding, achieving 99.3% parameter reduction over Galileo while maintaining accuracy.
Foundational failure detection (Foundational World Models Accurately Detect..., 2026) applied pre-trained latent-space world models as anomaly detectors for bimanual robots with conformal prediction safety guarantees.
(Kinematics-Aware, 2026) grounded latent dynamics in explicit vehicle kinematics and spatial structure, improving mean return by 23.1% while reaching stable performance in 80k steps versus 300k+ for PPO.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Pre-trained Feature Latent Dynamics	Use pre-trained patch-level visual features as the state space and train a transformer to predict future features conditioned on actions.	Improves on IRIS by +45% average success rate on the hardest navigation and manipulation tasks, achieving 56% better visual reconstruction fidelity (LPIPS).	DINO-WM (2024), Foundational World Models Accurately Detect... (2026)
Action-Conditioned Occupancy Forecasting	Forecast structured 3D space occupancy under different action hypotheses rather than generating raw video frames.	Drive-OccWorld improves on prior occupancy methods by +9.5% mIoU on nuScenes, achieving 38.2% mIoU; Kinematics-Aware model improves +23.1% Mean Return over image-only baselines.	Driving in the Occupancy World:... (2024), Kinematics-Aware (2026)
Inverse RL Reward World Models	Learn a differentiable Reward World Model from expert demonstrations that scores trajectories for safety and compliance without sensor simulation.	Achieves 45.0 EDPMS on NAVSIM v2, securing 1st runner-up at CVPR 2025 Autonomous Grand Challenge over prior open-loop VLA baselines.	IRL-VLA (2025)
Explicit Motion-Reasoning World Models	Insert an explicit motion-reasoning intermediate step between current observation and future state prediction to enforce physical plausibility.	FlowVLA achieves state-of-the-art on CALVIN robot manipulation benchmarks with substantially improved sample efficiency over UniVLA and WorldVLA baselines.	FlowVLA (2025), Constructing the Umwelt (2025)
Planetary-Scale 4D Space-Time World Models	Concatenate features from spatial and spatio-temporal hash grids with learned collision resolution for efficient 4D Earth-scale indexing.	Improves on standard hash encoding by +35.0% R² (0.783 vs 0.58) on Live Fuel Moisture prediction; achieves 99.3% parameter reduction (5M vs 800M) over the Galileo foundation model.	Self-Supervised (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
NAVSIM v2	EDPMS (Ego-Pseudo Driving Metric System)	45.0 EDPMS	IRL-VLA (2025)
nuScenes Occupancy Forecasting	mIoU (mean Intersection over Union)	+9.5% mIoU over prior state-of-the-art	Driving in the Occupancy World:... (2024)
CALVIN Robot Manipulation	Task Success Rate	State-of-the-art (specific value not reported)	FlowVLA (2025)
Zero-shot Navigation and Manipulation (MiniGrid, DM Control)	Success Rate	+45% average success rate over IRIS on hardest tasks	DINO-WM (2024)
Live Fuel Moisture Content Prediction	R² (coefficient of determination)	0.783 R²	Self-Supervised (2026)

⚠️ Known Limitations (4)

Long-horizon prediction degradation: world model accuracy deteriorates significantly over extended prediction horizons, making multi-second planning unreliable for safety-critical applications. (affects: Pre-trained Feature Latent Dynamics, Action-Conditioned Occupancy Forecasting, Explicit Motion-Reasoning World Models)
Potential fix: Hierarchical prediction at multiple temporal resolutions, or cognitive approaches like TIWM that reason about sparse intents rather than dense pixel-level futures.
Sim-to-real transfer gap: world models trained in simulation or on offline data may not faithfully represent real-world physics, leading to planning failures during deployment. (affects: Action-Conditioned Occupancy Forecasting, Inverse RL Reward World Models)
Potential fix: Foundation Veridical World Models with causal reasoning as proposed in the causality framework, or grounding latent spaces in explicit kinematics to enforce physical consistency.
Lack of unified evaluation: no standardized benchmark exists across driving, manipulation, and other domains, making it difficult to compare world model approaches and track overall field progress. (affects: Pre-trained Feature Latent Dynamics, Action-Conditioned Occupancy Forecasting, Explicit Motion-Reasoning World Models, Planetary-Scale 4D Space-Time World Models)
Potential fix: Establishing cross-domain benchmark suites that test both prediction fidelity and downstream planning performance, as advocated by VLA surveys.
Computational overhead: training and running world models adds significant cost on top of base policies, particularly for methods requiring high-resolution 3D occupancy prediction or multi-modal fusion. (affects: Action-Conditioned Occupancy Forecasting, Planetary-Scale 4D Space-Time World Models)
Potential fix: Parameter-efficient approaches like DeepEarth's 4D hash encoding (99.3% parameter reduction) or compact latent-space methods like DINO-WM and the Cosmos-based failure detector (1/20th parameters).

📚 View major papers in this topic (11)

IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model for End-to-End Autonomous Driving (2025-08) 8
FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models (2025-08) 8
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (2024-11) 8
Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding (2026-03) 8
Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving (2024-08) 7
The Essential Role of Causality in Foundation World Models for Embodied AI (2024-02) 7
A Survey on Vision-Language-Action Models for Embodied AI (2024-05) 7
Embodied AI Agents: Modeling the World (2025-06) 7
Constructing the Umwelt: Cognitive Planning through Belief-Intent Co-Evolution (2025-10) 7
Foundational World Models Accurately Detect Bimanual Manipulator Failures (2026-03) 7
Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving (2026-03) 7

💡 Moving to the next paradigm, we turn to Multimodal Generation.

🕸️

Multimodal Generation

What: Research on generating content across multiple modalities (images, 3D, video, speech) using unified generative frameworks including diffusion models, flow matching, and reinforcement-learning-enhanced optimization.

Why: Enabling machines to create high-quality, controllable content across modalities is essential for creative tools, robotics, scientific discovery, and human-AI interaction.

Baseline: Standard generative models (GANs, vanilla diffusion) produce content in single modalities with limited controllability and no cross-modal consistency guarantees.

Sparse reward signals in RL-based generation fail to credit individual denoising steps appropriately
Maintaining geometric consistency and fine-grained details across 3D views and multimodal outputs
Balancing identity fidelity with editability and safety in personalized generation tasks

🧪 Running Example

❓ Generate a 3D model of a golden dragon statue with reflective metallic scales, then produce multi-view renderings with consistent specular highlights.

Baseline: A standard diffusion model generates plausible 2D images but produces inconsistent geometry across views, lacks realistic view-dependent reflections, and cannot incorporate human preference feedback to iteratively improve quality.

Challenge: This example requires 3D geometric consistency (multi-view coherence), view-dependent appearance modeling (metallic reflections), and fine-grained reward decomposition to identify which denoising steps contribute to texture quality versus geometric accuracy.

✅ Hierarchical GRPO for 3D Generation: Decomposes generation into global shape planning and local texture refinement stages, using tailored reward models at each stage to ensure both geometric accuracy and surface detail quality.

✅ Surface Light Field Tokenization (LiTo): Encodes view-dependent radiance into latent vectors with higher-order spherical harmonics, enabling consistent specular highlights that move naturally with camera viewpoint.

✅ TurningPoint-GRPO: Assigns incremental rewards to each denoising step based on its measured contribution, so steps that establish correct geometry are credited differently from those refining texture.

📈 Overall Progress

The field has evolved from isolated single-modality generation toward unified frameworks that handle multiple tasks (generation, optimization, planning) within a single model. The most significant paradigm shift has been the adoption of reinforcement learning — particularly GRPO variants — as a universal fine-tuning strategy across 2D, 3D, and embodied generation domains, with increasingly sophisticated reward decomposition. Simultaneously, theoretical work has matured, providing rigorous mathematical foundations (Wasserstein gradient flows, topological analysis) for emerging generative approaches.

📂 Sub-topics

RL-Optimized Multimodal Generation

5 papers

Applying reinforcement learning — particularly Group Relative Policy Optimization (GRPO) variants — to improve generative model outputs across image, 3D, and robotic domains by optimizing reward signals during the denoising process.

TurningPoint-GRPO HCM-GRPO Syn-GRPO Hi-GRPO

Diffusion-Based 3D & Scene Generation

4 papers

Using diffusion models and flow matching to generate 3D scenes, motion plans, and structural ensembles with physics-based constraints and multi-view consistency.

SceneDiffuser GUMP DiffBacChrom LiTo

Multimodal Understanding, Reasoning & Embeddings

4 papers

Methods for jointly reasoning across modalities, producing unified embeddings, modeling inter/intra-modality dependencies, and auditing black-box vision systems through semantic approaches.

Think-Then-Embed I2M2 UNBOX Image Hijacks

Personalized & Empathetic Generation

2 papers

Generating identity-preserving portraits and emotionally responsive multimodal content (text, voice, avatar) that maintains consistency across attributes and modalities.

UniPortrait E3RG

Generative AI Foundations & Surveys

4 papers

Theoretical frameworks for generative modeling (gradient flows, topological analysis) and comprehensive surveys mapping the AIGC landscape from GANs to multimodal LLMs.

Gradient Flow Drifting Cross-Persistence Density Unified AIGC Taxonomy

Domain-Specific & Applied Generation

6 papers

Application of multimodal generative models to specialized domains including molecular design, nanophotonic fabrication, medical imaging, and federated learning for tactile internet.

PepFlow Gen-Fab DRMF

Human-AI Co-Creation & Interaction

5 papers

Studies on how humans collaborate with generative AI tools for creative design, exploring prompt strategies, trust dynamics, context-aware generation workflows, and frameworks for collaborative ideation.

CoT Prompt Coaching ContextCam

💡 Key Insights

💡 RL fine-tuning via GRPO variants improves generation quality across 2D, 3D, and robotics

💡 Step-wise reward decomposition significantly outperforms sparse terminal rewards for denoising

💡 Small RL-tuned models (2B parameters) can surpass large proprietary models on specialized tasks

💡 Unified diffusion handles generation, optimization, and planning within one framework

💡 Reasoning before embedding boosts multimodal representation quality by over 10%

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from foundational diffusion architectures and surveys (2023) through specialized multi-modal and personalized generation (2024) to RL-optimized generation with principled step-wise reward design (2025–2026), with a clear trend toward smaller models achieving parity with large proprietary systems through targeted RL training.

2023-01 to 2023-12 Foundational generative frameworks, adversarial studies, and landscape surveys

SceneDiffuser (Diffusion-based Generation, Optimization, and Planning..., 2023) introduced unified diffusion for joint 3D scene generation, physics optimization, and planning — achieving 49.35% physical plausibility vs 14.64% for cVAE baselines
(Image Hijacks, 2023) revealed critical vulnerabilities in VLMs through adversarial image optimization, achieving 100% attack success rate
The AIGC Survey (A Comprehensive Survey of AI-Generated Content, 2023) mapped the generative AI landscape from GANs to ChatGPT, identifying the Transformer as the convergence point for vision and language

2024-01 to 2024-12 Scaling multimodal generation with specialized architectures and personalization

(Full-Atom, 2024) pioneered multi-modal Riemannian flow matching across four geometric manifolds (R3, SO(3), Hypertorus, Simplex) for molecular design
GUMP (Solving Motion Planning Tasks with..., 2024) demonstrated a single generative world model serving simultaneously as simulator, planner, and RL environment for autonomous driving
(UniPortrait, 2024) solved multi-identity image personalization with plug-and-play ID embedding decoupling and spatial routing, outperforming InstantID and FastComposer
I2M2 (Jointly Modeling Inter- & Intra-Modality Dependencies, 2024) introduced a Product of Experts approach to dynamically leverage inter- and intra-modality dependencies for multi-modal learning

2025-01 to 2026-03 RL-driven generation revolution and theoretical consolidation

Hi-GRPO (Are We Ready for RL..., 2025) conducted the first systematic study of RL for 3D generation with hierarchical reward decomposition, achieving 28.5 CLIP Score on MME-3DR
(Self-Evolving, 2025) solved entropy collapse via asynchronous on-the-fly data synthesis with diversity rewards, improving +3.4% over Visual-RFT
TP-GRPO (Alleviating Sparse Rewards in Flow-Based GRPO, 2026) replaced sparse terminal rewards with incremental per-step credit assignment and turning-point detection for flow-based generation
(Think-Then-Embed, 2025) bridged generative reasoning and embedding quality by introducing intermediate reasoning traces, achieving 71.5% state-of-the-art on MMEB-V2
(Gradient Flow Drifting, 2026) unified the theoretical foundations of drifting generative models through Wasserstein gradient flow equivalence, enabling principled divergence mixing
LiTo (Surface Light Field Tokenization, 2026) introduced the first latent 3D representation jointly modeling geometry and view-dependent appearance with spherical harmonics

🔀 Reinforcement learning — especially GRPO variants — became the dominant paradigm for improving generative outputs across 2D images, 3D assets, and robotic policies, replacing purely supervised or GAN-based optimization with reward-driven trajectory comparison.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Group Relative Policy Optimization for Generation	Rank groups of generation trajectories by reward signals and optimize the policy to favor higher-ranked outputs, with variants addressing step-wise credit assignment, hard-case mining, and data diversity.	Hi-GRPO improves on base ShapeLLM-Omni by +8.7 CLIP Score on MME-3DR, achieving 28.5 vs 19.8; Syn-GRPO improves on Visual-RFT by +3.4% accuracy on RefCOCOg; HCM-GRPO with a 2B model surpasses GPT-4o by +20 points on aesthetic reasoning.	Alleviating Sparse Rewards by Modeling... (2026), Image Aesthetic Reasoning via HCM-GRPO:... (2025), Syn-GRPO (2025), Are We Ready for RL... (2025), Reinforcement Learning for Flow-Matching Policies (2025)
Unified Diffusion for 3D Scene Understanding	Inject physics constraints (collision, contact) and goals (target location) as differentiable gradients during each denoising step, replacing separate planners and optimizers with one sampling loop.	SceneDiffuser achieves 49.35% physically plausible human poses vs 14.64% for cVAE baselines (+34.7 pp); attains 71.27% grasp success where cVAE+optimization fails completely (0.00%).	Diffusion-based Generation, Optimization, and Planning... (2023), Solving Motion Planning Tasks with... (2024)
Identity-Preserving Personalized Generation	Decouple identity into intrinsic features and spatial structure branches, with dynamic ID routing that assigns the best-matching identity to each spatial location during generation.	UniPortrait achieves higher identity similarity (CS-I) and prompt consistency (CLIP-T) than InstantID and IP-Adapter-FaceID-Plus on single-ID benchmarks; outperforms FastComposer on multi-ID customization.	UniPortrait (2024), E3RG (2025)
Think-Then-Embed Multimodal Reasoning	Generate an Embedding-Centric Reasoning (ECR) trace before creating the embedding, conditioning the final representation on both the original input and the intermediate reasoning.	TTEt-7B achieves 71.5% on MMEB-V2, surpassing proprietary models like seed-1.6-embedding; TTEs-7B outperforms VLM2Vec-V2 by +7.4% on MMEB-V1, achieving state-of-the-art; TTEt-2B improves over VLM2Vec-V2 2B by +10.6% on MMEB-V2.	Think-Then-Embed (2025)
Gradient Flow Drifting Framework	The drifting field in generative drifting models equals the Wasserstein-2 gradient flow velocity for KDE-smoothed KL divergence, generalizable to any f-divergence or MMD.	Provides the first rigorous theoretical foundation for Drifting Models, which previously relied on heuristic analysis; generalizes to arbitrary f-divergences (Reverse KL, Chi-squared) and principled mixing of mode-seeking and mode-covering flows.	Gradient Flow Drifting (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MME-3DR (3D Generation Quality)	CLIP Score	28.5 CLIP Score	Are We Ready for RL... (2025)
MMEB-V2 (Massive Multimodal Embedding Benchmark)	Average Score (%)	71.5%	Think-Then-Embed (2025)
3D Scene Physical Plausibility	Physical Plausibility Rate (%)	49.35%	Diffusion-based Generation, Optimization, and Planning... (2023)
RefCOCOg (Referring Expression Comprehension)	Accuracy (%)	+3.4% over Visual-RFT baseline	Syn-GRPO (2025)
Image Aesthetic Reasoning Benchmark	Accuracy Score	64.74	Image Aesthetic Reasoning via HCM-GRPO:... (2025)

⚠️ Known Limitations (4)

Reward design sensitivity: RL-based generation methods are highly sensitive to reward model choice and design — poor rewards lead to mode collapse or reward hacking rather than genuine quality improvement, especially for 3D tasks with higher spatial complexity. (affects: Group Relative Policy Optimization for Generation (GRPO Family), Unified Diffusion for 3D Scene Understanding)
Potential fix: Hierarchical reward decomposition (Hi-GRPO) and incremental per-step rewards (TP-GRPO) partially address this by providing more granular, less noisy feedback signals; ensemble reward models combining human preference, aesthetic, and LMM-based evaluators further improve robustness.
Computational cost and scalability: RL fine-tuning requires generating multiple complete trajectories per optimization step, multiplying training cost; diffusion-based 3D methods require many iterative denoising steps per sample, limiting real-time applicability. (affects: Group Relative Policy Optimization for Generation (GRPO Family), Unified Diffusion for 3D Scene Understanding)
Potential fix: Partial-autoregressive decoding (GUMP) and variable-horizon generation reduce inference cost by 50-85%; asynchronous data synthesis (Syn-GRPO) improves training efficiency by generating diverse samples on-the-fly.
Evaluation gaps: Current benchmarks typically measure single aspects (e.g., CLIP alignment, FID) while ignoring perceptual quality, physical plausibility, or user preference holistically, making comprehensive quality assessment of multimodal generation difficult. (affects: Group Relative Policy Optimization for Generation (GRPO Family), Identity-Preserving Personalized Generation)
Potential fix: HCM-GRPO proposes dedicated aesthetic reasoning benchmarks with 128k samples; combining multiple reward models (human preference, aesthetic, LMM-based) provides more holistic multi-dimensional evaluation.
Adversarial vulnerability: Generative models with continuous image input channels are susceptible to adversarial manipulation, where optimized images can completely override intended model behavior with near-perfect success rates, and existing safety mechanisms provide no defense. (affects: Identity-Preserving Personalized Generation, Think-Then-Embed Multimodal Reasoning)
Potential fix: Current text-based safety training is ineffective against image-channel attacks; robust adversarial training against image perturbations and input validation pipelines are needed but remain an open research problem.

📚 View major papers in this topic (10)

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation (2025-12) 8
Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning (2025-11) 8
Diffusion-based Generation, Optimization, and Planning in 3D Scenes (2023-01) 8
Solving Motion Planning Tasks with a Scalable Generative Model (2024-07) 8
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization (2024-08) 8
Think-Then-Embed: Transforming MLLMs into Personalized Multimodal Embedding Models (2025-12) 8
Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences (2026-03) 8
Image Hijacks: Adversarial Images can Control Generative Models at Runtime (2023-09) 8
LiTo: Surface Light Field Tokenization (2026-03) 8
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT (2023-12) 8

💡 Diving deeper into Multimodal Generation, let's examine specific research threads that define this area.

📐

Text-to-Image Generation

What: Text-to-Image Generation synthesizes high-fidelity images from natural language descriptions using diffusion models, autoregressive transformers, and flow-matching architectures.

Why: Enabling anyone to create photorealistic or artistic images from text unlocks creative workflows across design, entertainment, education, and scientific visualization.

Baseline: Standard text-to-image diffusion models generate images by iteratively denoising random noise conditioned on CLIP-encoded text embeddings via a U-Net backbone.

Aligning generated images with complex compositional prompts involving multiple objects, attributes, and spatial relationships
Preserving specific subject identity while maintaining text editability and generation diversity
Achieving high-quality generation efficiently with reduced inference steps and computational cost

🧪 Running Example

❓ Generate an image of 'my specific golden retriever wearing red sunglasses, skateboarding down a neon-lit Tokyo street at sunset'

Baseline: A standard diffusion model generates a generic golden retriever (not the user's specific dog), may omit or misplace the sunglasses, and struggle with the spatial relationship between the dog, skateboard, and street scene.

Challenge: This prompt requires compositional reasoning (multiple objects with specific attributes), identity preservation (user's specific dog), and spatial understanding (dog on skateboard in a street scene). It also requires generating the image efficiently for interactive use.

✅ GRPO-Based Visual Alignment: Flow-GRPO or DanceGRPO fine-tunes the model using reward signals from human preference models, improving compositional accuracy so the sunglasses and skateboard appear correctly bound to the dog.

✅ Chain-of-Thought Reasoning for Generation: T2I-R1 or GoT first generates a textual reasoning plan ('place dog on skateboard at center, add sunglasses, neon lights on sides') with spatial coordinates before pixel generation, ensuring correct layout.

✅ Training-Free Identity Personalization: InstantID or Personalize Anything injects the user's specific dog identity from a reference photo via parallel attention branches, preserving the exact appearance without model fine-tuning.

✅ Direct Reward Fine-Tuning: DRaFT or Diff-Instruct* backpropagates aesthetic and alignment reward gradients directly through the denoising chain, boosting visual quality and prompt adherence in one or few steps.

📈 Overall Progress

Text-to-image generation has progressed from basic supervised diffusion models to a mature ecosystem encompassing RL-aligned generation, explicit reasoning pipelines, and efficient one-step synthesis. The field witnessed three paradigm shifts: from supervised to RL-based alignment (2023), from generic to identity-preserving personalization (2024), and from direct text-to-image mapping to reasoning-guided generation (2025–2026). The GRPO framework emerged as the dominant alignment paradigm, spawning over 30 specialized variants in 2025 alone.

📂 Sub-topics

RL-Based Preference Alignment

55 papers

Methods that apply reinforcement learning—particularly Group Relative Policy Optimization (GRPO) and its variants—to align diffusion and flow-matching models with human preferences using reward signals.

Flow-GRPO DanceGRPO BranchGRPO DiffusionNFT

Reward Modeling & Direct Preference Optimization

35 papers

Reward models for evaluating generated images and DPO-based methods that bypass explicit reward functions by learning directly from preference pairs to fine-tune diffusion models.

D3PO SPO Diffusion-DPO LLaVA-Reward

Reasoning-Enhanced Generation

20 papers

Methods that inject explicit chain-of-thought reasoning, visual planning, or code-based planning before or during the image generation process to improve compositional accuracy.

GoT T2I-R1 ReasonGen-R1 ImageGen-CoT

Personalized & Subject-Driven Generation

60 papers

Techniques for customizing text-to-image models to generate images of specific user-provided subjects (faces, objects, styles) while maintaining text-based editability and multi-subject composition.

InstantID DreamBooth JeDi Personalize Anything

Efficient Inference & Model Compression

25 papers

Post-training quantization, one-step distillation, token merging, and architectural efficiency methods that enable fast deployment of large-scale diffusion and transformer-based generators.

PTQ4DiT Adversarial Post-Training Diff-Instruct* PTQD

Unified Multi-Modal Architectures

25 papers

Models that unify text understanding, image generation, and other modalities (audio, video) within a single architecture, enabling interleaved generation and any-to-any transformation.

CM3Leon UniDiffuser Mixture-of-Transformers MMaDA

Safety, Robustness & Evaluation

18 papers

Concept erasure, adversarial robustness, watermarking, and evaluation benchmarks that ensure generated content is safe, attributable, and faithfully measured.

OrthoEraser GHOST CIGEval mAVE

💡 Key Insights

💡 GRPO-based RL has become the dominant alignment paradigm with 30+ variants in 2025 alone

💡 Chain-of-thought reasoning before generation boosts compositional accuracy by 13–68%

💡 One-step models now outperform multi-step giants when combined with score-based alignment

💡 Identity personalization shifted from minutes of fine-tuning to seconds of zero-shot encoding

💡 Early denoising steps determine semantic diversity while late steps control fine-grained detail

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has converged on three frontiers: (1) increasingly sophisticated RL alignment methods that exploit temporal structure in diffusion, (2) chain-of-thought reasoning that decomposes complex prompts before generation, and (3) unified architectures that merge understanding and generation in a single model.

2023-02 to 2023-12 Foundation: Diffusion fine-tuning, early RL alignment, personalization primitives, and first quantization methods

(DPOK, 2023) established online RL for diffusion with KL-regularized policy gradients
D3PO (Using Human Feedback to Fine-tune..., 2023) applied DPO directly to diffusion's multi-step MDP, eliminating separate reward models
DRaFT (Directly Fine-Tuning Diffusion Models on..., 2023) pioneered backpropagation through the full sampling chain, >200x faster than RL
CM3Leon (Scaling Autoregressive Multi-Modal Models, 2023) achieved zero-shot FID 4.88, proving autoregressive models can rival diffusion with 5x less compute
PTQD (Accurate Post-Training Quantization for Diffusion Models, 2023) introduced quantization noise correction that absorbs error into diffusion variance

🔀 The shift from supervised fine-tuning to reinforcement learning for diffusion model alignment, with DPOK and D3PO establishing the multi-step MDP framework.

2024-01 to 2024-12 Scaling up: Large-scale RL alignment, unified multi-modal models, DiT quantization, and zero-shot personalization

(InstantID, 2024) enabled plug-and-play face personalization using face recognition embeddings
Diff-Instruct* (David and Goliath, 2024) showed a 2.6B one-step model can outperform 12B FLUX-dev using score-based divergence RLHF
PTQ4DiT (Post-training Quantization for Diffusion Transformers, 2024) achieved the first effective 4-bit weight quantization for DiT architectures
Large-Scale RL (Large-scale Reinforcement Learning for Diffusion Models, 2024) scaled RL to millions of prompts with distribution-based fairness rewards
(MoT, 2024) matched dense baseline performance using only 55.8% of training FLOPs via modality-specific parameter routing

2025-01 to 2026-03 GRPO revolution, chain-of-thought reasoning, and unified generation-understanding models

(Flow-GRPO, 2025) introduced ODE-to-SDE conversion enabling online RL for flow models, boosting GenEval from 63% to 95%
(DanceGRPO, 2025) unified GRPO for diffusion and rectified flow, scaling stably to 10,000+ prompts
T2I-R1 (T2I-R1, 2025) introduced bi-level chain-of-thought (semantic + token level) for reasoning-enhanced generation
(APT, 2025) enabled one-step 1280×720 video generation by training against real data rather than distilling from a teacher
Seedream 4.0 (Seedream 4.0, 2025) unified T2I, editing, and multi-image composition, ranking #1 on Artificial Analysis Arena
(EndoCoT, 2026) scaled chain-of-thought to diffusion transformers, achieving 92.1% on complex reasoning benchmarks

🔀 The explosion of GRPO variants transformed visual generation alignment from unstable RL into a principled, scalable framework, while chain-of-thought reasoning bridged the gap between language understanding and visual synthesis.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Group Relative Policy Optimization for Visual Generation	Generate multiple images per prompt, compute relative rewards within the group, and update the policy to favor high-reward trajectories over low-reward ones.	Improves on DDPO/DPOK by scaling to 10,000+ prompts stably. Flow-GRPO boosts SD3.5-M GenEval from 63% to 95%, and DanceGRPO achieves +181% on VideoAlign motion quality.	Flow-GRPO (2025), DanceGRPO (2025), DiffusionNFT (2025), BranchGRPO (2025), TempFlow-GRPO (2025)
Direct Reward Fine-Tuning & Preference Optimization	Treat the sampling chain as a differentiable computation graph and propagate reward signals directly to model parameters, or use preference pairs to implicitly learn optimal rewards.	Improves on standard RL (DDPO) by >200x faster convergence. Diff-Instruct* (2.6B, 1-step) outperforms FLUX-dev (12B, 50-step) on ImageReward and PickScore.	Diff-Instruct*: Small One-step Model Beats... (2024), DRaFT (2023), D3PO (2023), TDM-R1 (2026)
Chain-of-Thought Reasoning for Image Generation	Decompose image generation into a reasoning phase (producing plans, layouts, or code scaffolds) and a synthesis phase, optimized jointly via reinforcement learning.	Improves on direct text-to-image generation by +13% on T2I-CompBench (T2I-R1 vs. Janus-Pro) and +68.83% on StructT2IBench (CoCo vs. Bagel baseline).	T2I-R1 (2025), GoT (2025), CoCo (2026), EndoCoT (2026)
Training-Free Identity-Preserving Personalization	Decouple identity encoding from text conditioning using specialized face or subject encoders injected via parallel attention, ControlNet-like branches, or token replacement.	Improves on DreamBooth by eliminating fine-tuning (100x speedup from minutes to seconds) while achieving competitive or superior identity fidelity. InstantID matches LoRA methods using a single reference image.	InstantID (2024), Personalize Anything for Free with... (2025), JeDi (2024), InfiniteYou (2025)
Post-Training Quantization & One-Step Distillation	Adapt quantization parameters to the temporal dynamics of diffusion (progressive calibration, distribution-aware grouping) or replace iterative denoising with one-step adversarial generation.	PTQ4DiT achieves near-lossless W8A8 on DiT-XL where baselines degrade to 58.74 FID. Adversarial Post-Training enables one-step 1280×720 video at 24fps on a single H100.	PTQ4DiT (2024), Diffusion Adversarial Post-Training for One-Step... (2025), PTQD (2023), PCR (2023)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
GenEval	Overall Accuracy (%)	98% (GenEval score 0.98)	DiffusionNFT (2025)
T2I-CompBench	Average Compositional Score	+13% over Janus-Pro baseline	T2I-R1 (2025)
HPSv2.1 (Human Preference Score v2.1)	HPSv2.1 Score	31.19 HPSv2.1	David and Goliath (2024)
PickScore	Win Rate / PickScore	67.48% win-rate on Pick-a-Pic v1 test set	LPO (2025)

⚠️ Known Limitations (4)

Reward hacking and mode collapse: Models optimized with RL or DPO tend to exploit imperfect reward models, generating high-scoring but low-diversity or unrealistic images that overfit to the proxy objective. (affects: GRPO-Based Visual Alignment, Direct Reward Fine-Tuning & Preference Optimization)
Potential fix: Pairwise preference rewards (Pref-GRPO), distribution-aware reward bonuses for rare clusters (DiverseGRPO), annealed importance guidance (AIG), and self-entropy regularization (SEE-DPO) all address this by explicitly encouraging diversity.
Identity-editability tradeoff in personalization: Methods that strongly preserve subject identity often lose the ability to follow complex text prompts, while editable methods sacrifice identity fidelity. (affects: Training-Free Identity-Preserving Personalization)
Potential fix: Decoupling identity from text via separate attention branches (Infinite-ID), parallel attention architectures (Imagine yourself), and synthetic data curricula that balance identity and editability during training.
Sparse credit assignment across denoising timesteps: Most RL methods assign a single terminal reward uniformly to all timesteps, ignoring that early steps determine structure while late steps refine details. (affects: GRPO-Based Visual Alignment)
Potential fix: Trajectory branching at specific timesteps (TempFlow-GRPO, BranchGRPO, TreeGRPO), tree-structured rollouts with depth-wise advantage estimation, and chunk-level optimization that groups timesteps by temporal dynamics.
Computational cost of RL training for visual models: Full trajectory sampling, large group sizes, and multi-step backpropagation make RL fine-tuning prohibitively expensive, limiting accessibility. (affects: GRPO-Based Visual Alignment, Direct Reward Fine-Tuning & Preference Optimization)
Potential fix: Prefix reuse in tree-structured rollouts (TreeGRPO, 2.4x speedup), early trajectory pruning via ODE preview (Pro-GRPO), and deterministic ODE-based training that avoids SDE overhead (Neighbor GRPO, 12x fewer forward-backward calculations).

📚 View major papers in this topic (10)

Flow-GRPO: Training Flow Matching Models via Online RL (2025-05) 9
DanceGRPO: Unleashing GRPO on Visual Generation (2025-05) 9
DiffusionNFT: Online Diffusion Reinforcement with Forward Process (2025-09) 9
Diff-Instruct*: Small One-step Model Beats Large Diffusion with Score Post-training (2024-10) 9
Diffusion Adversarial Post-Training for One-Step Video Generation (2025-01) 9
Seedream 4.0: Toward Next-generation Multimodal Image Generation (2025-09) 9
InstantID: Zero-shot Identity-Preserving Generation in Seconds (2024-01) 9
CM3Leon: Scaling Autoregressive Multi-Modal Models (2023-09) 9
RewardDance: Reward Scaling in Visual Generation (2025-09) 9
TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward (2026-03) 9

💡 Within the same paradigm, another important research direction focuses on Text-to-Video Generation.

🎯

Text-to-Video Generation

What: Research on generating temporally coherent, high-fidelity video sequences from textual descriptions using diffusion models, autoregressive transformers, and flow matching architectures.

Why: Enabling automated video creation democratizes content production for entertainment, education, robotics simulation, and embodied AI planning.

Baseline: Standard text-to-video diffusion models iteratively denoise latent representations conditioned on text embeddings, requiring 50+ sampling steps with limited motion control.

Maintaining temporal coherence and motion consistency across frames while scaling to longer durations
Aligning generated videos with human aesthetic preferences and physical plausibility beyond training data
Reducing computational cost of iterative diffusion sampling for real-time or interactive applications

🧪 Running Example

❓ Generate a 10-second video of a golden retriever running through a sunflower field, stopping to sniff a flower, then looking up at the camera

Baseline: A standard 50-step diffusion model produces a blurry dog-like shape drifting across static flowers with flickering artifacts, inconsistent dog appearance between frames, and the 'sniffing' action entirely ignored — the dog teleports from running to standing

Challenge: This example requires temporal coherence (consistent dog appearance), action understanding (running → stopping → sniffing → looking up), physical plausibility (natural deceleration), and must be generated fast enough for iterative creative workflows

✅ Reward-Aligned RL Post-Training: DanceGRPO fine-tunes the model using reward signals from motion quality and text-alignment models, teaching it to properly sequence the run-stop-sniff-lookup actions with smooth transitions

✅ Adversarial Post-Training & Fast Distillation: APT compresses the 50-step generation into a single forward pass while maintaining visual fidelity, enabling real-time previewing and iterative creative refinement

✅ Test-Time Compute Scaling for Video: EvoSearch explores multiple generation paths at inference time using evolutionary algorithms, selecting the video with the best motion quality and prompt adherence from an evolving population

✅ Video World Models for Embodied Intelligence: Drive-WM's physics-aware generation would ensure the dog decelerates naturally before stopping, maintaining physical plausibility in the motion dynamics

📈 Overall Progress

Text-to-video generation has evolved from basic text-conditioned diffusion requiring 50+ slow sampling steps to real-time, one-step generation with human-preference alignment. The field underwent two major paradigm shifts: first from supervised training to reward-based RL post-training (2024), then from training-time-only improvement to test-time compute scaling (2025). Simultaneously, the scope expanded dramatically — from short single-clip generation to multi-scene narrative films and physically-grounded world simulation for robotics and autonomous driving.

📂 Sub-topics

Reward-Based Post-Training & Alignment

13 papers

Methods that use reinforcement learning, reward models, and human preference optimization to align video diffusion models with quality, motion, and text-adherence objectives after initial pre-training. This is the largest and most active sub-topic.

DanceGRPO VADER RewardDance TAGRPO

Efficient & Few-Step Video Generation

6 papers

Distillation and adversarial training techniques that reduce the number of sampling steps from 50+ to 1–4 steps, enabling real-time or near-real-time video generation without prohibitive quality loss.

APT AAPT DOLLAR T2V-Turbo

Long-Form & Multi-Scene Narrative Generation

7 papers

Approaches for generating coherent multi-shot, multi-scene videos with consistent characters and narrative structure, extending video generation beyond single-clip outputs to minutes-long storytelling.

LCT MovieAgent InfLVG COMIC

Video World Models & Embodied AI

8 papers

Using video generation as physics-aware world simulators for robotics planning, autonomous driving, and embodied agents, where generated videos must be physically consistent and action-conditioned.

WMPO RLIR RLWG Drive-WM

Video Personalization & Identity Preservation

5 papers

Techniques for generating videos featuring specific identities, styles, or dynamic concepts from reference images or short clips, without expensive per-subject test-time optimization.

AnimateDiff Video Alchemist Movie Weaver Set-and-Sequence

Audio-Visual Joint Generation

3 papers

Unified frameworks that generate synchronized audio and video simultaneously, including speech with lip-sync, sound effects, and music aligned to visual content.

Seedance 1.5 Pro MM-LDM MM-Sonate

Human Motion & Avatar Animation

4 papers

Generating realistic human body motion, co-speech gestures, sign language, and unified multi-task avatar animation from text, audio, or multimodal inputs.

LMM EchoMimicV3 FreeTalker MaDiS

Test-Time Compute Scaling

2 papers

Methods that allocate additional computation at inference time through search, evolutionary algorithms, or tree-based exploration to improve video quality without retraining.

EvoSearch Video-T1

Large-Scale Foundation Models & Architectures

5 papers

Scaling video generation models to tens of billions of parameters with novel architectures including flow matching transformers, autoregressive token-based approaches, and unified multimodal systems.

Movie Gen Amazon Nova Reel DiCoDe Yume

Evaluation Benchmarks & Quality Assessment

2 papers

New benchmarks and evaluation methodologies for assessing video generation quality, including object state changes, temporal hallucinations, and physical plausibility.

OSCBench CounterVid

💡 Key Insights

💡 GRPO-based RL post-training improves video motion quality by up to 181% over prior RL methods

💡 Adversarial post-training enables one-step video generation that surpasses multi-step diffusion quality

💡 Test-time evolutionary search lets small models match 10× larger models without retraining

💡 Physical plausibility remains a fundamental gap despite achieving high visual fidelity scores

💡 Multi-scene narrative video now spans 20+ coherent shots and 3 minutes of consistent content

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has converged on GRPO-based reinforcement learning as the dominant post-training paradigm while branching into two complementary directions: ultra-fast generation via adversarial one-step methods and ultra-high-quality generation via test-time scaling. Increasingly, video generation is being applied beyond content creation to embodied AI, world modeling, and automated professional production.

2023-05 to 2023-11 Early controllable generation and compositional planning foundations

(Control-A-Video, 2023) introduced motion-adaptive noise priors and spatio-temporal reward feedback for controllable text-to-video generation
(AnimateDiff, 2023) demonstrated plug-and-play motion modules that animate any personalized text-to-image model without model-specific tuning
HiP (Compositional Foundation Models for Hierarchical Planning, 2023) pioneered iterative refinement across language, video, and action foundation models for long-horizon embodied planning
Drive-WM (Driving into the Future, 2023) introduced the first multiview driving world model compatible with end-to-end planners, achieving 3.65 FID on nuScenes

2024-01 to 2024-12 Reward-guided alignment, foundation model scaling, and distillation breakthroughs

T2(T2V-Turbo, 2024) broke the consistency model quality bottleneck by integrating mixed spatial-temporal reward feedback during distillation, achieving >10× inference acceleration
VADER (Video Diffusion Alignment via Reward Gradients, 2024) pioneered backpropagating differentiable reward gradients through video denoising on consumer hardware (16GB VRAM)
(Movie Gen, 2024) scaled flow matching transformers to 30B parameters for 1080p HD video with integrated audio, editing, and personalization capabilities
T2V-Turbo-v2 (T2V-Turbo-v2, 2024) achieved 85.13 VBench Total Score SOTA by combining offline motion guidance with multi-reward consistency distillation, surpassing Gen-3 and Kling
(DOLLAR, 2024) combined variational score and consistency distillation with latent reward models for 278.6× inference speedup
(Large Motion Model, 2024) consolidated 16 motion datasets into the MotionVerse benchmark with 320k sequences for unified multi-task motion generation

🔀 Shift from supervised training to reward-based post-training — models began using RL and differentiable rewards to align video outputs with human preferences, moving beyond simple likelihood optimization.

2025-01 to 2025-12 GRPO revolution, one-step generation, test-time scaling, and multi-scene narratives

(Diffusion Adversarial Post-Training, 2025) achieved one-step 1280×720 video generation at real-time speed by training directly against real data with a 16B-parameter GAN
(DanceGRPO, 2025) established the foundational GRPO framework for visual generation, outperforming DDPO/DPOK by up to 181% on motion quality
(LCT, 2025) expanded single-shot models to generate coherent 20-shot, 3-minute narrative videos via attention window expansion
(Autoregressive Adversarial Post-Training, 2025) enabled real-time 24fps interactive streaming of 1-minute consistent videos on a single H100
(RewardDance, 2025) scaled generative reward models to 26B parameters with Chain-of-Thought reasoning, drastically reducing reward hacking
(EvoSearch, 2025) introduced test-time evolutionary search where a 1.3B model matches performance of the 10× larger 14B model
(Video Alchemist, 2025) achieved +23.2% subject similarity improvement in open-set multi-subject video personalization without test-time optimization
Seedance 1.5 (Seedance 1.5 Pro, 2025) demonstrated native joint audio-visual generation with RLHF, achieving >10× inference speedup via multi-stage distillation

🔀 Group Relative Policy Optimization (GRPO) became the dominant RL paradigm for video generation post-training, while test-time compute scaling emerged as a complementary training-free approach to improve quality.

2026-01 to 2026-03 Autonomous embodied learning, automated content production, and evaluation maturation

(PlayWorld, 2026) demonstrated autonomous robot play for world model training, improving real-world policy success rates by 65%
(COMIC, 2026) achieved fully automated sketch comedy production using multi-agent iterative competition with engagement-calibrated critics
(FlashMotion, 2026) solved trajectory control in few-step distilled generators via three-stage hybrid adapter tuning (FID 14.35 in 4 steps)
(OSCBench, 2026) introduced systematic evaluation of object state change understanding across 1,120 prompts and 6 SOTA models
(MaDiS, 2026) achieved state-of-the-art sign language generation with masked diffusion, reducing inference latency by ~30% over autoregressive baselines

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Reward-Aligned RL Post-Training	Group Relative Policy Optimization (GRPO) uses group-based advantage estimation from reward models to stabilize RL training of visual generators without a separate value network.	DanceGRPO improves over DDPO/DPOK by +181% on VideoAlign motion quality benchmarks; T2V-Turbo-v2 achieves 85.13 VBench Total Score, surpassing Gen-3 (82.32) and Kling (81.85)	DanceGRPO (2025), RewardDance (2025), Video Diffusion Alignment via Reward... (2024), T2V-Turbo-v2 (2024), PhysCorr (2025)
Adversarial Post-Training & Fast Distillation	Adversarial Post-Training (APT) trains a generator directly against real data using a GAN objective, abandoning teacher-student distillation entirely for one-step video generation.	APT surpasses 25-step diffusion baseline by +32.3% in visual fidelity preference for one-step generation; AAPT achieves real-time 24fps at 736×416 on a single H100 GPU	Diffusion Adversarial Post-Training for One-Step... (2025), Autoregressive Adversarial Post-Training for Real-Time... (2025), DOLLAR (2024), FlashMotion (2026)
Test-Time Compute Scaling for Video	Reformulates video denoising as a search problem where evolutionary algorithms mutate and evolve latent states to discover high-quality generation paths.	EvoSearch with Wan 1.3B achieves competitive performance with the 10× larger Wan 14B model; Video-T1's Tree-of-Frames reduces scaling cost compared to random linear search	Scaling Image and Video Generation... (2025), Video-T1 (2025)
Long Context & Multi-Scene Narrative Generation	Long Context Tuning (LCT) expands the attention window of single-shot models to process all shots simultaneously with interleaved 3D positional embeddings and asynchronous diffusion timesteps.	LCT generates coherent 20-shot, 3-minute videos from single-shot models; InfLVG extends generation length by 9× over standard autoregressive baselines while maintaining consistency	Long Context Tuning (2025), Automated Movie Generation via Multi-Agent... (2025), InfLVG (2025), COMIC (2026)
Video World Models for Embodied Intelligence	World Model-based Policy Optimization (WMPO) replaces real-world RL rollouts with imagined video trajectories from a pixel-space diffusion backbone, enabling safe on-policy learning.	PlayWorld improves real-world robotic policy success rates by 65% over pre-trained policies; Drive-WM achieves 3.65 FID on nuScenes, outperforming DriveDreamer (5.21 FID)	WMPO (2025), PlayWorld (2026), Driving into the Future: Multiview... (2023), Reinforcement Learning with Inverse Rewards... (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
VBench	Total Score (percentage, higher is better)	85.13%	T2V-Turbo-v2 (2024)
VideoAlign Motion Quality	Motion Quality Score (relative improvement over baselines)	+181% over baselines	DanceGRPO (2025)
nuScenes Video Generation	FID (Fréchet Inception Distance, lower is better)	3.65 FID	Driving into the Future: Multiview... (2023)
One-Step Video Generation Preference	Human Preference Rate (percentage preferring one-step over multi-step)	+32.3% preference over 25-step baseline	Diffusion Adversarial Post-Training for One-Step... (2025)

⚠️ Known Limitations (4)

Physical plausibility violations: Generated videos frequently break fundamental physics laws (gravity, object permanence, fluid dynamics) despite impressive visual quality, limiting deployment in simulation and robotics domains (affects: Reward-Aligned RL Post-Training, Adversarial Post-Training & Fast Distillation)
Potential fix: Physics-specific reward models (PhysicsRM) and synthetic physics datasets for targeted fine-tuning; PISA shows as few as 5,000 synthetic samples can teach pre-trained models specific physical behaviors like gravity
Reward hacking and Goodhart's Law: Sustained RL optimization causes reward models to lose fidelity as quality proxies, with models exploiting shortcuts (improving one metric at the expense of others) and reward scores saturating within a few hundred training steps (affects: Reward-Aligned RL Post-Training)
Potential fix: TaRoS dynamically rebalances reward components based on intra-group discriminative ability; RewardDance scales reward models to 26B parameters with CoT reasoning to maintain signal quality and reduce hacking
Identity blending in multi-subject personalization: When multiple reference identities are provided, attributes from different subjects frequently merge into composite characters, especially for same-gender or same-race pairs, undermining practical personalization (affects: Long Context & Multi-Scene Narrative Generation, Reward-Aligned RL Post-Training)
Potential fix: Anchored prompts with concept-specific embeddings explicitly link each reference image to its text entity (Movie Weaver); identity-preserving reward models trained on human preference data provide feedback for GRPO-based alignment (Identity-GRPO)
Computational cost of RL post-training: GRPO-based methods require generating multiple candidate videos per prompt for advantage estimation, creating significant GPU memory and time overhead that limits scalability to larger prompt sets and higher resolutions (affects: Reward-Aligned RL Post-Training, Test-Time Compute Scaling for Video)
Potential fix: Bayesian prior-guided optimization (BPGO) filters noisy reward signals to converge faster with fewer samples; trajectory alignment with memory banks (TAGRPO) avoids expensive re-generation by reusing past high/low reward trajectories

📚 View major papers in this topic (10)

DanceGRPO: Unleashing GRPO on Visual Generation (2025-05) 9
Diffusion Adversarial Post-Training for One-Step Video Generation (2025-01) 9
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation (2025-06) 9
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training (2024-10) 9
Movie Gen: A Cast of Media Foundation Models (2024-10) 9
RewardDance: Reward Scaling in Visual Generation (2025-09) 9
Scaling Image and Video Generation via Test-Time Evolutionary Search (2025-05) 8
Long Context Tuning (2025-03) 8
PlayWorld: Learning Robot World Models from Autonomous Play (2026-03) 8
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning (2023-07) 8

💡 Within the same paradigm, another important research direction focuses on Image Editing.

🔄

Image Editing

What: Image editing research develops methods to modify existing images based on text instructions, reference images, or user intent while preserving unedited content and identity.

Why: Enabling intuitive visual content manipulation empowers creators and everyday users to realize creative visions without tedious manual pixel-level work.

Baseline: Standard diffusion models generate images from text prompts but lack precise control over local edits, identity preservation, and complex multi-attribute modifications.

Preserving unedited regions and subject identity while applying targeted semantic modifications to specific areas
Handling complex multi-object instructions with correct attribute binding, spatial reasoning, and style consistency
Building reliable automated reward signals to train and evaluate editing quality at scale

🧪 Running Example

❓ Edit this photo of a golden retriever on a beach: change the dog's pose to sitting, replace the background with a snowy mountain, and preserve the dog's exact appearance.

Baseline: A standard diffusion model would regenerate the entire image from the text description, losing the dog's specific identity, fur pattern, and producing inconsistent edits across regions.

Challenge: This example requires three simultaneous capabilities: (1) preserving the dog's identity (personalization), (2) changing pose and background independently (disentangled editing), and (3) ensuring spatial coherence between foreground subject and new background (compositional reasoning).

✅ Reward-Conditioned Editing: Multi-Reward Condition (MRC) decomposes quality into instruction following, detail preservation, and generation quality scores, guiding the model to maximize all three—ensuring the dog's identity is preserved while the pose and background change accurately.

✅ Training-Free Personalization: JeDi's coupled self-attention learns the dog's identity from a single reference image without fine-tuning, enabling identity-consistent generation in the new snowy mountain context.

✅ Reasoning-Guided Visual Generation: RPG uses an MLLM to decompose the complex instruction into sub-prompts ('sitting golden retriever' for foreground, 'snowy mountain' for background) and generates each region independently via Complementary Regional Diffusion.

✅ MM-DiT Attention Manipulation: HeadRouter identifies attention heads specialized for pose vs. background semantics in MM-DiT and routes the edit signal to the relevant heads, enabling precise local modifications without affecting other regions.

✅ Parameter-Efficient Diffusion Personalization: SVDiff fine-tunes only the singular values of weight matrices on the dog's photo in ~1.7MB, capturing identity efficiently while PALP's score distillation ensures the 'snowy mountain' style is faithfully rendered.

📈 Overall Progress

Image editing research has evolved from parameter-heavy per-subject fine-tuning (2023) through training-free personalization and LLM-driven compositional reasoning (2024) to unified multimodal frameworks that jointly handle generation, editing, and composition with RL-optimized reward signals (2025–2026). A key paradigm shift was the integration of reward models as first-class components, enabling automated quality evaluation that rivals human experts. The field has converged toward systems where reasoning, generation, and self-correction operate as coordinated agents rather than isolated pipelines.

📂 Sub-topics

Reward-Driven Image Editing

4 papers

Methods that use reward models, reinforcement learning, and preference learning to improve instruction following, detail preservation, and generation quality in image editing systems.

Multi-Reward Condition EditScore OneReward Joint Reward Modeling

Subject & Style Personalization

9 papers

Techniques for generating and editing images that preserve specific subject identity, artistic style, or visual attributes from reference images, including both fine-tuning-based and training-free approaches.

SVDiff JeDi HiPer RB-Modulation

Reasoning-Guided & Interactive Editing

4 papers

Approaches that leverage multimodal LLM reasoning, chain-of-thought planning, and interactive user interfaces to handle complex compositional edits and improve prompt engineering for image generation.

RPG GoT SIDiffAgent PromptCharm

Architecture Adaptation & Efficient Editing

5 papers

Research on adapting modern diffusion transformer architectures (MM-DiT) for editing tasks, accelerating inference through sparse parameterization, and building unified generation-editing frameworks.

HeadRouter Input Projection Editing Sparse-LaViDa Seedream 4.0

Image Restoration & Enhancement

4 papers

Diffusion-based methods for super-resolution, denoising, medical image reconstruction, and coarse-to-fine visual refinement that restore or enhance image quality under degradation.

QUSR DECADE Weighted h-Transform Segment-First Inpainting

Adversarial Robustness of Diffusion Models

1 papers

Research on understanding and exploiting vulnerabilities in text-to-image diffusion models through multi-modal adversarial attacks that manipulate generated content.

MMP-Attack

💡 Key Insights

💡 Decomposed reward signals dramatically improve instruction following and detail preservation in editing

💡 Training-free personalization now matches or exceeds fine-tuning-based methods in identity fidelity

💡 LLM chain-of-thought reasoning unlocks compositional generation that direct text-to-image mapping cannot achieve

💡 MM-DiT attention heads naturally specialize for different semantics, enabling targeted editing without retraining

💡 Unified generation-editing frameworks outperform separate specialized models on both tasks simultaneously

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressively moved from isolated editing capabilities toward unified, reward-driven, and reasoning-guided systems. Early work focused on efficient personalization; mid-period work introduced compositional planning via LLMs; recent work emphasizes self-improving agentic systems with internalized reward models that enable continuous quality improvement.

2023-03 to 2023-05 Foundations of efficient personalization and style discovery in diffusion models

(SVDiff, 2023) introduced spectral shift fine-tuning, reducing personalization checkpoints from 3.66GB to 1.7MB while improving multi-subject disentanglement
HiPer (Highly Personalized Text Embedding for..., 2023) demonstrated that decomposing text embeddings into semantic head and personalized tail enables single-image personalization in 3 minutes
(ProSpect, 2023) discovered that diffusion denoising stages correspond to visual attributes in frequency order (layout → content → style)
Null-text cartoonization (Null-text Guidance is Secretly a..., 2023) revealed that perturbing the null-text branch in Classifier-Free Guidance produces cartoon stylization without any training

2024-01 to 2024-11 Maturation of personalization methods and emergence of reasoning-guided and reward-based editing

(Mastering Text-to-Image Diffusion, 2024) pioneered using MLLM chain-of-thought reasoning as a global planner for compositional generation with Complementary Regional Diffusion
(Joint-Image, 2024) eliminated fine-tuning entirely by learning joint image distributions with coupled self-attention, outperforming even DreamBooth
(Multi-Reward, 2024) introduced quality-aware conditioning that decomposes reward into instruction following, detail preserving, and generation quality
(HeadRouter, 2024) discovered semantic specialization of attention heads in MM-DiTs, enabling training-free text-guided editing on next-generation architectures

🔀 Shift from single-concept fine-tuning toward training-free personalization and LLM-driven compositional planning for complex multi-object edits.

2025-01 to 2025-12 Unified generation-editing frameworks, reward-driven RL training, and architecture-level efficiency

Seedream 4.0 (Seedream 4.0, 2025) unified T2I synthesis, editing, and multi-image composition in a single framework, ranking 1st on Artificial Analysis Arena and outperforming GPT-Image-1
(EditScore, 2025) established a rigorous reward benchmark and fine-tuned VLM-based reward models that surpass GPT-4o/5, enabling stable online RL for editing
(Generation Chain-of-Thought, 2025) introduced explicit language reasoning before pixel generation with a Semantic-Spatial Guidance Module for unified generation and editing
(Sparse-LaViDa, 2025) achieved 2.83× speedup on editing tasks while improving accuracy through sparse token processing with step-causal masking and KV-caching

🔀 Transition from separate generation and editing pipelines to unified multimodal systems that jointly handle T2I, editing, and composition within a single model.

2026-02 to 2026-03 Advanced reward internalization, self-improving agents, and quality-aware restoration

Joint Reward Modeling (Internalizing Chain-of-Thought for Efficient Visual..., 2026) introduced Latent CoT that internalizes generative reasoning into efficient discriminative scoring, achieving 85.1% on EditReward-Bench and surpassing GPT-5 by 9.6%
(Self-Improving, 2026) proposed Theory-of-Mind inspired self-improving agents with experience-driven memory, improving GenAIBench VQA Score by +8.73% over prior agentic systems
(Quality-Aware, 2026) leveraged VLM-generated quality descriptions and pixel-wise uncertainty maps to achieve state-of-the-art restoration, reducing FID by 16.74 on DRealSR

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Reward-Conditioned Editing	Decompose editing quality into explicit reward dimensions (instruction following, detail preservation, generation quality) and condition or optimize the editor against them.	EditScore-72B achieves 86.36% accuracy on EditReward-Bench, surpassing GPT-4o (84.41%) and GPT-5 (85.29%); MRC improves InsPix2Pix by +9.4% Instruction Following on Real-Edit.	EditScore (2025), Joint Reward Modeling (2026), Multi-Reward (2024), OneReward (2025)
Training-Free Personalization	Inject reference image features into the diffusion process at inference time through attention manipulation or joint distribution modeling, eliminating the need for per-subject training.	JeDi outperforms fine-tuning-based DreamBooth in CLIP-I and DINO subject fidelity scores on the DreamBooth dataset while requiring zero test-time optimization.	JeDi (2024), RB-Modulation (2024), FreeTuner (2024)
Reasoning-Guided Visual Generation	Use multimodal LLMs as planners that reason about spatial relationships and semantic structure before delegating sub-regions or editing steps to diffusion models.	SIDiffAgent achieves +8.73% VQA Score on GenAIBench over T2I-Copilot and +5.36% over proprietary Imagen 3.	Mastering Text-to-Image Diffusion (2024), GoT (2025), SIDiffAgent (2026)
MM-DiT Attention Manipulation	Decompose MM-DiT's joint attention into functional blocks and selectively modify image input projections or route text guidance to semantically sensitive heads for localized edits.	Input projection editing achieves robust editing across 5 MM-DiT variants (SD3, SD3.5, Flux.1) while maintaining inference speed within 2% of standard generation (15.2s vs 14.9s).	HeadRouter (2024), Exploring Multimodal Diffusion Transformers for... (2025)
Parameter-Efficient Diffusion Personalization	Fine-tune minimal parameter subsets (spectral shifts, embedding tails, prompt spectra, or decoupled identity modules) to capture subject identity without full model retraining.	SVDiff reduces checkpoint size to ~1.7MB per subject (vs. 3.66GB for DreamBooth, a ~2,200× reduction) while achieving 60.9% user preference over full-weight fine-tuning for multi-subject generation.	SVDiff (2023), Highly Personalized Text Embedding for... (2023), ProSpect (2023), Infinite-ID (2024), PALP (2024)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
EditReward-Bench	Accuracy (%)	86.36%	EditScore (2025)
GEdit-Bench	Editing Success Rate (%)	+14.6% improvement over base OmniGen2	EditScore (2025)
GenAIBench	VQA Score	+8.73% over T2I-Copilot	SIDiffAgent (2026)
DRealSR	FID (Fréchet Inception Distance, lower is better)	State-of-the-art across all metrics	QUSR (2026)
Artificial Analysis Arena	Arena Ranking	Rank 1st in both single-image editing and T2I tracks	Seedream 4.0 (2025)

⚠️ Known Limitations (4)

Reward model accuracy ceiling — even the best reward models (86.36%) disagree with human experts ~14% of the time, potentially misguiding RL optimization toward non-human-aligned outputs (affects: Reward-Conditioned Editing)
Potential fix: Joint training of discriminative and generative reward objectives (Latent CoT) improves reasoning capabilities; self-ensembling reduces variance in reward estimates
Identity-text entanglement — most methods face a trade-off between preserving subject identity fidelity and adhering to complex text prompts describing new contexts or styles (affects: Training-Free Personalization, Parameter-Efficient Diffusion Personalization)
Potential fix: Explicit ID-semantics decoupling via separate attention modules (Infinite-ID) or score distillation to prevent prompt forgetting (PALP)
Computational cost of reasoning-guided methods — LLM-based planning adds significant latency (multiple LLM inference calls per image) and may not scale to real-time interactive editing applications (affects: Reasoning-Guided Visual Generation)
Potential fix: Experience-driven memory caching (SIDiffAgent) reduces redundant planning; adversarial acceleration (Seedream 4.0) enables 1.4s generation at 2K resolution
Limited generalization of attention manipulation to new architectures — methods designed for specific MM-DiT variants may not transfer to future architectures with different attention patterns (affects: MM-DiT Attention Manipulation)
Potential fix: Instance-adaptive routing (HeadRouter) and block selection strategies generalize across 5+ MM-DiT variants without model-specific tuning, suggesting architecture-agnostic principles may exist

📚 View major papers in this topic (10)

💡 Within the same paradigm, another important research direction focuses on Unified Understanding and Generation.

🔍

Unified Understanding and Generation

What: Research on building single models that jointly understand and generate content across multiple modalities (text, image, audio) within a shared architecture.

Why: Separate understanding and generation models are inefficient, miss cross-modal synergies, and cannot leverage comprehension to improve generation quality.

Baseline: Pipeline approaches using separate specialized models for understanding (e.g., LLaVA for vision-language) and generation (e.g., Stable Diffusion for images).

Bridging the cognitive gap between understanding and generation within shared model parameters
Preventing task conflict and quality degradation when jointly training on multiple modalities
Maintaining generation quality and coherence in long interleaved multi-modal sequences

🧪 Running Example

❓ Given a reference photo of a red fox and the text 'The fox leaps across a frozen river under a glowing moon,' generate a storybook illustration that preserves the fox's appearance while matching the described scene.

Baseline: A pipeline approach would use a vision model to caption the fox photo and a separate text-to-image model to generate the scene, but the generated fox would lose its distinctive features and the scene composition would ignore visual context from the reference photo.

Challenge: This example requires understanding the reference image (fox appearance), reasoning about the text description (scene layout, moonlight), and generating a coherent image — all within one model. It also illustrates the cognitive gap: the model may 'understand' the fox but fail to translate that understanding into generation-friendly features.

✅ Autoregressive Multimodal Pretraining: Lumina-mGPT and Mogao use unified autoregressive architectures that encode the fox photo and text in a shared token sequence, generating image tokens conditioned on both inputs to preserve cross-modal coherence.

✅ Chain-of-Thought Image Generation: ImageGen-CoT generates an explicit reasoning chain ('The fox has red-orange fur with white chest markings; the scene needs blue moonlight reflecting on ice...') before producing the image, ensuring generation aligns with both the reference and the description.

✅ Endogenous Reprompting (SEER): SEER transforms the model's understanding of the fox photo into a self-aligned descriptor optimized for the internal generator, bridging the cognitive gap between comprehension and image synthesis using only 300 seed samples.

✅ Unified Diffusion-Based Generation: MMaDA treats both the text description and image output as discrete tokens under a shared diffusion process, with UniGRPO reinforcement learning optimizing for both semantic accuracy and visual quality.

📈 Overall Progress

The field has progressed from separate understanding and generation models to unified architectures that handle both within shared parameters. Key paradigm shifts include the move from modality-specific designs to modality-agnostic diffusion and autoregressive frameworks, and from direct generation to reasoning-then-generating approaches. Recent work addresses frontier challenges like long-horizon coherence and speech integration, pushing toward truly omni-modal systems.

📂 Sub-topics

Unified Model Architectures

5 papers

Architectural approaches for building single models that handle both understanding and generation, including autoregressive, diffusion-based, and flow-based designs with strategies to prevent task conflict in shared parameters.

Autoregressive Multimodal Pretraining Unified Diffusion-Based Generation

Reasoning-Enhanced Generation

4 papers

Methods that introduce explicit reasoning steps — chain-of-thought, self-evaluation, reinforcement learning — before or during generation to improve instruction adherence, subject fidelity, and compositional accuracy.

Chain-of-Thought Image Generation Reasoning-Enhanced Personalization

Interleaved and Omni-Modal Generation

2 papers

Research on generating long interleaved text-image sequences and integrating additional modalities such as speech, addressing quality collapse in extended generation and efficient multi-modal fusion.

Long-Horizon Context Curation Speech-Centric Omni-Cognition

💡 Key Insights

💡 Explicit reasoning before generation improves instruction adherence by 89–160%

💡 Visual history actively pollutes long-horizon generation after ~20 discrete images

💡 Decoupled routing prevents understanding-generation task conflict in unified models

💡 Diffusion models can match autoregressive LLMs on reasoning with RL post-training

💡 Complementary text-image reasoning outperforms redundant cross-modal descriptions

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from foundational joint modeling (2023) through architectural innovations for multi-modal fusion (2024–2025) to reasoning-enhanced and reliability-focused generation (2025–2026), with increasing emphasis on explicit chain-of-thought and reinforcement learning to bridge the understanding-generation gap.

2023-06 to 2024-08 Foundational multi-modal generative architectures

(Multi-Modal, 2023) pioneered decoupled encoding with shared diffusion modeling, resolving the coherence-quality tradeoff that plagued multi-modal VAEs by achieving 85.22% joint coherence
(Lumina-mGPT, 2024) demonstrated that pure autoregressive decoder-only models can match diffusion model quality for photorealistic generation through Flexible Progressive Supervised Fine-tuning, training a versatile 7B model in just 7 days

🔀 Shift from modality-specific models to unified architectures that jointly model multiple modalities under shared probabilistic frameworks.

2024-12 to 2025-06 Architectural scaling and omni-modal expansion

(OmniFlow, 2024) extended rectified flows to any-to-any generation across text, image, and audio with novel multi-modal guidance and model merging for stable training
(Lyra, 2024) integrated speech into multimodal LLMs through latent cross-modality regularization and dynamic token reduction, enabling multi-hour speech processing
(Mogao, 2025) introduced decoupled QKV/FFN routing and Efficient Complete Teacher Forcing for causal interleaved multi-modal generation, achieving 83.3% on MME perception
(MMaDA, 2025) achieved the first fully modality-agnostic diffusion model with UniGRPO reinforcement learning, outperforming autoregressive LLMs on reasoning benchmarks while excelling at image generation

2025-08 to 2026-03 Reasoning-enhanced generation and long-horizon reliability

MM-R1 (MM-R1, 2025) applied cross-modal chain-of-thought with GRPO reinforcement learning for zero-shot personalized generation without subject-specific fine-tuning
(ImageGen-CoT, 2025) introduced structured reasoning before image generation with hybrid scaling, improving CoBSAT scores by 89% and DreamBench++ by 114%
(ThinkMorph, 2025) established complementary interleaved reasoning where text and image thoughts advance problem-solving synergistically, enabling a 7B model to surpass 38B models on spatial reasoning
(SEER, 2026) developed self-evolving cognitive alignment using only 300 seed samples, proving that optimizing reasoning outperforms optimizing pixel-level execution
(UniLongGen, 2026) identified the event bottleneck in long-horizon generation — quality collapses after ~20 visual events — and proposed training-free layer-split visibility to sustain generation fidelity

🔀 Shift from direct generation to reasoning-then-generating paradigms where models explicitly plan and reason before producing visual output.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Autoregressive Multimodal Pretraining	Initialize from strong multimodal bases and use decoupled routing or progressive fine-tuning to prevent understanding-generation task conflict.	Mogao achieves 83.3% on MME perception, surpassing Emu2 by +5.0pp (78.3%) and Mantis-8B by +2.7pp (80.6%); Lumina-mGPT matches SD3 and DALL-E 3 image quality using a pure autoregressive approach.	Lumina-mGPT (2024), Mogao (2025)
Unified Diffusion-Based Generation	Model all modalities as discrete tokens or latent vectors under a unified denoising process with modality-agnostic architecture.	MLD achieves 85.22% joint coherence on MNIST-SVHN, improving over MVTCAE by +36pp; OmniFlow achieves 1.79 FAD (Fréchet Audio Distance) for audio, improving over AudioMAE baseline of 2.03; MMaDA surpasses LLaMA-3-7B on GSM8K and MATH reasoning despite being a diffusion model.	Multi-Modal (2023), OmniFlow (2024), MMaDA (2025)
Chain-of-Thought Image Generation	Generate structured textual reasoning (chain-of-thought) prior to image synthesis, with complementary text-image thoughts advancing reasoning synergistically.	ImageGen-CoT improves SEED-X by +89% on CoBSAT, achieving 0.909 (from 0.349 baseline), and +114% on DreamBench++, achieving 0.543 (from 0.188); ThinkMorph achieves +85.84% accuracy on Spatial Navigation (VSP), surpassing InternVL3.5-38B on SAT reasoning (52.67% vs 49.33%); SEER outperforms Emu3 and Janus-Pro in instruction adherence using only 300 seed samples.	ImageGen-CoT (2025), MM-R1 (2025), ThinkMorph (2025), Endogenous Reprompting (2026)
Long-Horizon Context Curation	Generation fails based on discrete visual event count (~20 images), not token length; layer-split attention separates text grounding from image synthesis.	Demonstrates that 150k text tokens maintain high fidelity while 150k image tokens (~30 images) cause total collapse; UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency.	How Long Can Unified Multimodal... (2026)
Speech-Centric Omni-Cognition	Align speech tokens to text transcript embeddings in latent space and dynamically prune redundant tokens via attention-based similarity.	Achieves state-of-the-art across vision-language, vision-speech, and speech-language benchmarks compared to other omni-methods; compresses long speech to 300 tokens per segment for multi-hour processing.	Lyra (2024)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MME Perception	Accuracy (%)	83.3%	Mogao (2025)
CoBSAT	Accuracy Score (0–1)	0.909	ImageGen-CoT (2025)
DreamBench++	Score (0–1)	0.543	ImageGen-CoT (2025)
MNIST-SVHN Joint Coherence	Joint Coherence (%)	85.22%	Multi-Modal (2023)
SAT Spatial Reasoning	Accuracy (%)	52.67%	ThinkMorph (2025)

⚠️ Known Limitations (4)

Quality collapse in long interleaved sequences: accumulated visual tokens hijack attention, limiting practical applications like storybook or document generation beyond ~20 images (affects: Autoregressive Multimodal Pretraining, Chain-of-Thought Image Generation)
Potential fix: Layer-split visibility policies and context curation strategies that separate text grounding from image synthesis at different transformer layers (UniLongGen)
Cognitive gap between understanding and generation: models comprehend visual instructions but fail to translate that understanding into generator-friendly representations, causing instruction-following failures (affects: Autoregressive Multimodal Pretraining, Unified Diffusion-Based Generation)
Potential fix: Self-evolving reprompting mechanisms (SEER) that train models to generate self-aligned descriptors, optimizing reasoning prompts rather than pixel-level execution
Task conflict in shared parameters: joint training on understanding and generation degrades performance on one or both tasks due to gradient interference and competing optimization objectives (affects: Autoregressive Multimodal Pretraining, Unified Diffusion-Based Generation)
Potential fix: Decoupled QKV/FFN routing for separate task pathways (Mogao) or fully modality-agnostic architectures with task-specific RL fine-tuning (MMaDA)
Limited modality coverage: most unified models handle only text and images, with speech, audio, and video integration remaining underexplored and computationally expensive (affects: Autoregressive Multimodal Pretraining, Chain-of-Thought Image Generation)
Potential fix: Latent cross-modality regularization for speech alignment (Lyra) and multi-modal rectified flows for any-to-any generation (OmniFlow) that extend joint modeling to speech and audio

📚 View major papers in this topic (10)

Multi-Modal Latent Diffusion (2023-06) 8
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining (2024-08) 8
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition (2024-12) 8
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (2025-05) 8
MMaDA: Multimodal Large Diffusion Language Models (2025-06) 8
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning (2025-03) 8
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning (2025-10) 8
Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models (2026-01) 8
How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation (2026-03) 8
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows (2024-12) 7

💡 Moving to the next paradigm, we turn to Other MM Topics.

📦

GUI and Web Agents

What: Research on autonomous agents that perceive graphical user interfaces via vision-language models and execute multi-step tasks through mouse, keyboard, and touch interactions.

Why: Automating GUI-based workflows can dramatically reduce human effort on repetitive digital tasks across mobile, desktop, and web environments.

Baseline: Supervised fine-tuning on static human-annotated trajectories, where a VLM predicts the next action from a screenshot and text instruction.

Binary reward signals provide no gradient for near-miss clicks, making precise spatial grounding difficult to learn
Static offline training fails to generalize to dynamic, stochastic real-world interfaces that change across apps and updates
Long-horizon tasks compound errors across dozens of steps, where a single misclick can derail the entire workflow

🧪 Running Example

❓ Book a round-trip flight from San Francisco to New York for next Friday under $300 on a travel website.

Baseline: A supervised fine-tuned agent processes the screenshot and predicts clicks sequentially. It may click near the date picker but miss the exact target (binary reward gives no feedback on 'close' misses), select the wrong date because it overfits to training UI layouts, and fail to recover after navigating to a dead-end page.

Challenge: This task requires precise grounding (clicking small date cells and dropdown menus), long-horizon planning (search → filter → select → confirm → pay), adaptation to unseen airline website layouts, and verification that the booking actually succeeded rather than just appearing to.

✅ Continuous Spatial Reward RL: GUI-G2's Gaussian reward gives partial credit for clicks near the date picker target, enabling the agent to gradually learn precise localization instead of random trial-and-error.

✅ Online Agentic RL with GRPO: MobileRL trains in a live browser environment with adaptive difficulty filtering, so the agent learns to handle dynamic page loads, pop-ups, and layout changes that static training cannot capture.

✅ Agentic Verification and Pre-operative Critics: GUI-Critic-R1 evaluates the proposed 'Confirm Payment' action before execution, catching a potential mistake (wrong date selected) and suggesting a corrective action before an irreversible purchase.

✅ Coordinate-Free Visual Grounding: GUI-Actor uses attention-based localization to directly identify the 'Search Flights' button from visual features, avoiding fragile numeric coordinate prediction that fails on different screen resolutions.

✅ Autonomous Data Synthesis: AgentTrek harvests travel booking tutorials from the web and replays them in a live browser to generate diverse training trajectories, covering airline websites the agent has never seen before.

📈 Overall Progress

GUI agent research has undergone two major paradigm shifts in rapid succession. First, the move from text-based screen representations to direct visual perception (2023–2024) enabled zero-shot generalization across apps. Second, the shift from offline SFT to online RL with continuous rewards (2025) unlocked dramatic performance gains — small 7B models now routinely outperform 72B supervised baselines. The field is now entering a third phase focused on generalization, verification, and proactive agency.

📂 Sub-topics

RL Reward Design for GUI Grounding

12 papers

Papers that replace binary hit-or-miss reward signals with continuous, distance-based, or density-aware rewards to improve spatial precision in GUI element grounding through reinforcement learning.

Gaussian Continuous Reward Dense Point Reward Dynamic Location Reward Adaptive Exploration Policy Optimization

Online Environment RL Training

5 papers

Approaches that train GUI agents through live interaction with real or emulated environments using online reinforcement learning, replacing static offline datasets with dynamic exploration and self-play.

Offline-to-Online RL Difficulty-Adaptive GRPO Self-Evolving Agent Loop Experience Replay Policy Optimization

Visual Grounding Architectures

5 papers

Novel model architectures that improve how agents localize and identify GUI elements, including attention-based coordinate-free methods, high-resolution dual-branch encoders, and zoom-in refinement strategies.

Action Attention Grounding High-Resolution Cross-Module Two-Stage Zoom-In Mixed On-policy RL

Agent Architectures and Planning Frameworks

8 papers

Multi-agent systems, planning-execution-reflection pipelines, and autonomous exploration strategies that enable GUI agents to handle complex, long-horizon workflows across mobile and desktop environments.

Multi-Agent Collaboration Plan-Act-Reflect Pipeline GUI-DFS Exploration Set-of-Mark Prompting

Benchmarks, Evaluation, and Security

9 papers

Evaluation frameworks, knowledge benchmarks, automated auditing methods, and adversarial attacks targeting GUI agents, including proactive intent prediction and efficiency-based backdoor attacks.

Agentic Verification GUI Knowledge Decoupling Adversarial Pop-up Injection Proactive Intent Recommendation

💡 Key Insights

💡 Continuous spatial rewards outperform binary hit/miss by 20–25 points in GUI grounding

💡 Online RL in live environments yields 50%+ absolute gains over static supervised learning

💡 Small 7B RL-trained models routinely surpass 72B supervised baselines on grounding benchmarks

💡 Adversarial pop-ups achieve 86% attack success rate, exposing critical safety gaps in GUI agents

💡 Proactive intent prediction from passive screen observation defines the next frontier for GUI agents

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from proving VLM feasibility for GUI tasks (2023) through architectural innovations for visual grounding (2024) to a dominant focus on reinforcement learning with sophisticated reward design (2025–2026), with emerging attention to safety, continual learning, and proactive agent behavior.

2023-11 to 2024-06 Foundation: Zero-shot VLM agents and first autonomous frameworks

(GPT-4V, 2023) demonstrated zero-shot GUI navigation using Set-of-Mark visual tagging, outperforming supervised Llama-2 by +24.6 points on AITW
(CogAgent, 2023) introduced a dual-branch high-resolution cross-module for GUI understanding, achieving SOTA on AITW while reducing FLOPs by >50%
(Mobile-Agent, 2024) built the first vision-centric autonomous mobile agent decoupling planning (GPT-4V) from localization (specialized visual tools)
(DigiRL, 2024) pioneered online RL for device control with 64 parallel emulators, achieving 67.2% success on AITW — a +49.5% gain over SFT

🔀 Transition from text-based screen parsing to direct visual perception of GUI screenshots using multimodal models.

2024-09 to 2025-01 Scaling: Domain-specific pre-training, multi-agent collaboration, and early safety analysis

(MobileVLM, 2024) introduced graph-structured mobile pre-training with Mobile3M dataset, improving navigation by +34.2% over Qwen-VL-Max
Pop-up attack study (Attacking Vision-Language Computer Agents via Pop-ups, 2024) revealed that adversarial pop-ups achieve 86% attack success rate against VLM agents
(AgentTrek, 2024) demonstrated scalable trajectory synthesis from web tutorials at $0.55 per trajectory, improving WebArena success by +9.3%
(InfiGUIAgent, 2025) proposed two-stage SFT with synthesized native reasoning including expectation-reflection loops for self-correction

2025-03 to 2025-10 RL revolution: Continuous rewards, online training, and coordinate-free grounding

UI-R1 (UI-R1, 2025) showed that rule-based RL with just 136 samples can rival large-scale SFT models, gaining +22.1% on ScreenSpot
GUI-G2 (GUI-G2, 2025) introduced Gaussian reward modeling that outperformed UI-TARS-72B by +24.7 points on ScreenSpot-Pro with a 7B model
(GUI-Actor, 2025) eliminated coordinate prediction entirely via attention-based patch selection, outperforming UI-TARS-72B on ScreenSpot-Pro
(MobileRL, 2025) achieved 80.2% on AndroidWorld with difficulty-adaptive GRPO, surpassing previous SOTA by +16 points
(MiMo-VL, 2025) set new GUI grounding SOTA at 56.1 on OSWorld-G through mixed on-policy RL combining perception, grounding, and reasoning rewards

🔀 Shift from binary success/failure rewards to continuous spatial rewards and from offline SFT to online RL in live environments.

2026-01 to 2026-03 Maturation: Generalization, verification, continual learning, and proactive agents

(Agentic Reward Modeling, 2026) introduced agentic verification where reward models actively probe the environment, improving evaluation accuracy to 92.9%
BEPA (From Off-Policy to On-Policy, 2026) bridged expert framework systems and end-to-end agents via bi-level assimilation, reaching 32.1% on OSWorld-Verified
(OSExpert, 2026) introduced GUI-DFS exploration for autonomous skill discovery, tripling long-horizon task success and closing 80% of the human efficiency gap
(PIRA-Bench, 2026) defined the proactive intent recommendation paradigm where agents anticipate user goals from passive screen observation

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Continuous Spatial Reward RL	Model GUI elements as spatial distributions (Gaussian or distance-based) so near-miss clicks receive partial reward proportional to proximity.	Improves on UI-TARS-72B by +24.7 percentage points on ScreenSpot-Pro, achieving 47.5% accuracy (GUI-G2). SE-RFT achieves 47.3% with only 3,018 training samples, outperforming UI-TARS-72B by +24.2%.	GUI-G2 (2025), UI-R1 (2025), Enhancing Visual Grounding for GUI... (2025), UI-AGILE (2025), InfiGUI-G1 (2025), LPO (2025)
Online Agentic RL with GRPO	Deploy agents in parallel live environments with difficulty-adaptive curriculum and trajectory-level advantage scoring for long-horizon sparse rewards.	MobileRL achieves 80.2% success on AndroidWorld, improving over previous SOTA (64.2%) by +16.0 percentage points. DigiRL achieves 67.2% on AITW, a +49.5% absolute gain over SFT (17.7%).	DigiRL (2024), MobileRL (2025), ZeroGUI (2025), From Off-Policy to On-Policy: Enhancing... (2026), GUI-Libra (2026)
Coordinate-Free Visual Grounding	Use attention heads or zoom-in refinement to localize elements from visual features directly, bypassing text-based coordinate token generation.	GUI-Actor-7B achieves 44.6 on ScreenSpot-Pro, outperforming the 10× larger UI-TARS-72B (38.1) by +6.5 points. R-VLM improves grounding by +13% absolute over SeeClick across platforms.	CogAgent (2023), GUI-Actor (2025), R-VLM (2025), MiMo-VL (2025)
Agentic Verification and Pre-operative Critics	Empower verifier models with interactive capabilities to actively probe environment state rather than passively observing screenshots.	VAGEN improves evaluation accuracy from 84.7% (LLM-as-a-Judge) to 92.9% on OSWorld-Verified (+8.2%). GUI-Critic-R1 improves AndroidWorld success from 22.4% to 27.6% (+5.2%).	Agentic Reward Modeling (2026), Look Before You Leap: A... (2025), Guiding VLM Agents with Process... (2025)
Autonomous Data Synthesis for GUI Training	Replace expensive human-annotated trajectory collection with automated pipelines that convert web tutorials or random exploration into grounded training data.	AgentTrek achieves +9.3% task success on WebArena (22.4% vs 13.1% for GPT-4o) at $0.55/trajectory. GUI-Shift improves AndroidControl-High by +11.2% Exact Match over the base model.	AgentTrek (2024), GUI-Shift (2025), MobileGUI-RL (2025)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
ScreenSpot-Pro	Accuracy (%)	47.5%	GUI-G2 (2025)
AndroidWorld	Success Rate (%)	80.2%	MobileRL (2025)
OSWorld-Verified	Success Rate (%)	32.13%	From Off-Policy to On-Policy: Enhancing... (2026)
Android-in-the-Wild (AitW)	Success Rate (%)	67.2%	DigiRL (2024)
OSWorld-G (GUI Grounding)	Accuracy Score	56.1	MiMo-VL (2025)

⚠️ Known Limitations (4)

Generalization degrades sharply from seen to unseen applications — RL gains drop from 26% on familiar instances to only 8% on new apps, suggesting overfitting to specific UI patterns (affects: Online Agentic RL with GRPO, Continuous Spatial Reward RL)
Potential fix: Few-shot test-time adaptation, domain randomization across diverse UI layouts, and curriculum-based training spanning multiple app categories
Long-horizon desktop tasks remain largely unsolved with best success rates around 32%, as errors compound across dozens of sequential actions with no recovery mechanism (affects: Online Agentic RL with GRPO, Coordinate-Free Visual Grounding)
Potential fix: Hierarchical skill decomposition (as in OSExpert's GUI-DFS), process reward models for step-level correction, and modular planning with verified sub-goals
GUI agents are highly vulnerable to adversarial visual attacks — simple pop-up injections derail task completion in 86% of cases, and basic prompt-based defenses are ineffective (affects: Coordinate-Free Visual Grounding, Online Agentic RL with GRPO)
Potential fix: Adversarial training with injected distractors, element provenance verification, and safety-constrained action spaces that block interactions with unverified UI elements
Catastrophic forgetting when adapting to new apps — SFT overwrites old knowledge while RL struggles with sparse rewards in unfamiliar domains, requiring careful balancing of exploration and retention (affects: Continuous Spatial Reward RL, Online Agentic RL with GRPO)
Potential fix: Gradient surgery to project new learning onto conflict-free subspaces (CGL), entropy-guided SFT warmup with RL consolidation, and experience replay buffers

📚 View major papers in this topic (10)

MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents (2025-09) 9
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (2024-06) 9
CogAgent: A Visual Language Model for GUI Agents (2023-12) 9
GUI-G2: Gaussian Reward Modeling for GUI Grounding (2025-07) 8
Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction (2026-01) 8
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents (2025-06) 8
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (2024-12) 8
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (2023-11) 8
OSExpert: Computer-Use Agents Learning Professional Skills via Exploration (2026-03) 8
MiMo-VL Technical Report (2025-06) 8

💡 Another cross-cutting theme examines Remote Sensing and Geospatial.

🔬

Remote Sensing and Geospatial

What: Research on adapting vision-language and foundation models to interpret satellite, aerial, and Earth observation imagery for tasks like classification, segmentation, object detection, and spatial reasoning.

Why: Earth observation data is critical for disaster response, environmental monitoring, urban planning, and agriculture, yet general-purpose AI models struggle with its unique overhead perspectives and domain-specific semantics.

Baseline: Standard vision-language models like CLIP or LLaVA, pretrained on internet-scale natural images, applied directly to remote sensing tasks via zero-shot transfer or basic fine-tuning.

Severe scarcity of large-scale, annotated image-text datasets for satellite and aerial imagery domains
Unique visual characteristics including bird's-eye perspectives, extreme scale variation, and tiny objects in massive pixel spaces
Multi-modal heterogeneity across optical, SAR, infrared, and temporal data sources with alignment and fusion difficulties

🧪 Running Example

❓ After Hurricane Maria, identify damaged buildings in this satellite image of San Juan, estimate destruction severity, and recommend priority areas for rescue teams.

Baseline: A standard VLM (e.g., GPT-4V or LLaVA) struggles with overhead views of buildings, cannot reliably count damaged structures (R² = 0.10 for destruction counting), confuses rubble with construction sites, and lacks the temporal reasoning to compare pre- and post-disaster imagery.

Challenge: This example illustrates all three key challenges: (1) no large captioned disaster-imagery dataset exists for training, (2) damaged buildings appear as tiny irregular patches from orbit requiring fine-grained spatial reasoning, and (3) combining optical and SAR imagery (which can see through clouds during storms) requires multi-modal fusion.

✅ Ground-Remote Vision-Language Alignment: GRAFT-style approaches could align ground-level photos of hurricane damage with overhead satellite views, bootstrapping domain understanding without manual annotation of satellite images.

✅ Reinforcement Learning with Verifiable Rewards for RS: Few-shot RLVR can fine-tune a VLM using just a handful of labeled damage examples with binary reward signals, unlocking latent reasoning about destruction patterns without thousands of annotations.

✅ Granularity-oriented Mixture of Experts: RSUniVLM's G-MoE routes the query to specialized experts: an image-level expert for overall scene assessment, a region-level expert for localizing damaged blocks, and a pixel-level expert for segmenting individual collapsed structures.

✅ Factorized Multi-Modal Foundation Pretraining: SkySense's factorized encoder processes optical and SAR imagery in separate branches before fusion, enabling cloud-penetrating SAR to complement optical data during storm conditions for comprehensive damage mapping.

✅ Training-Free VLM Aerial Navigation: SPF enables a rescue drone to navigate to priority areas using natural language instructions ('fly to the collapsed building near the river'), converting VLM spatial reasoning into real-time 3D waypoints.

📈 Overall Progress

The field has progressed from lacking any large-scale RS image-text data (pre-2023) to having multiple million-scale datasets and billion-parameter foundation models. A major paradigm shift occurred around 2025 with reinforcement learning replacing supervised fine-tuning as the dominant adaptation strategy, enabling few-shot domain transfer. Concurrently, the field has moved from perception-only models toward agentic systems capable of multi-step spatial reasoning, tool use, and real-time drone navigation.

📂 Sub-topics

Remote Sensing Vision-Language Model Adaptation

12 papers

Methods for adapting general-purpose VLMs to remote sensing domains, including novel training data pipelines, annotation-free alignment strategies, and parameter-efficient fine-tuning techniques that bridge the domain gap between internet imagery and Earth observation data.

Ground-Remote Alignment (GRAFT) GeoRSCLIP/RS5M GeoChat OSMDA

Reinforcement Learning-Enhanced Reasoning for RS

8 papers

Applying reinforcement learning with verifiable rewards (RLVR) and group relative policy optimization (GRPO) to unlock and strengthen reasoning capabilities in remote sensing VLMs, especially in few-shot and resource-constrained settings.

Few-Shot RLVR SAMChat-R1 UAV-VL-R1 Text-Before-Vision

Benchmarks, Datasets, and Evaluation

11 papers

Construction of large-scale datasets and comprehensive benchmarks that expose the gap between general VLM capabilities and geospatial domain requirements, spanning scene classification, counting, change detection, and cartographic reasoning.

SkyScript MMEarth DisasterM3 GEOBench-VLM

Aerial and UAV Navigation and Tracking

7 papers

Leveraging VLMs for drone-based vision-and-language navigation, aerial object search, and multi-object tracking from UAV platforms, addressing challenges of real-time control, 3D spatial reasoning, and motion blur.

See-Point-Fly (SPF) AirHunt ViSA MM-Tracker

Multi-Modal Earth Observation and Foundation Models

15 papers

Large-scale self-supervised pretraining across multiple Earth observation modalities (optical, SAR, LiDAR, hyperspectral, climate data) and specialized applications including multi-modal fusion for detection, segmentation, hyperspectral unmixing, and geospatial intelligence.

SkySense DeepEarth/Earth4D SM3Det WS-Net

💡 Key Insights

💡 One training example with RL rewards can match thousands of supervised annotations for satellite VLMs

💡 Ground-level internet photos effectively bridge the satellite-to-language annotation gap

💡 Mixture of Experts prevents interference between image, region, and pixel-level RS tasks

💡 Text-only domain knowledge cold-start dramatically improves subsequent visual RL performance

💡 Spatial grounding outperforms text-based action prediction for drone navigation by 65+ points

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from constructing foundational datasets and adapting pretrained VLMs (2023) through benchmarking gaps and multi-modal architectures (2024) to RL-driven reasoning, ultra-high-resolution understanding, and deployable agentic systems for aerial navigation and disaster response (2025–2026).

2023-06 to 2023-12 Foundational RS VLMs and large-scale dataset construction

(RS5M, 2023) constructed the first 5-million-pair RS image-text dataset using filtered web data and generated captions
(SkyScript, 2023) mined 2.6 million pairs from OpenStreetMap with 29,000 semantic tags, two orders of magnitude richer than prior datasets
(Ground Remote Alignment, 2023) demonstrated annotation-free VLM training by using ground photos as a semantic bridge, outperforming supervised VLMs by 20%
(GeoChat, 2023) established the first grounded conversational RS VLM with task-specific tokens and 318k instruction pairs
(SkySense, 2023) introduced the first billion-scale multi-modal RS foundation model with factorized spatiotemporal encoding, achieving SOTA on all 16 benchmarks

🔀 Shift from small, manually annotated RS datasets to million-scale automatically constructed image-text pairs using geographic metadata, enabling the first zero-shot VLMs for remote sensing.

2024-01 to 2024-12 Benchmarking, MoE architectures, and multi-modal expansion

The GPT-4V Earth Observation benchmark (Good at captioning, bad at counting, 2024) revealed that frontier VLMs fail catastrophically on counting and change detection in satellite imagery
(MMEarth, 2024) created a 1.2-million-location, 12-modality pretraining corpus and proposed Multi-Pretext MAE for geospatial representation learning
(RSUniVLM, 2024) and (RS-MoE, 2024) introduced Mixture-of-Experts architectures for multi-granularity RS understanding
(GEOBench-VLM, 2024) established a 31-task benchmark showing the best model achieves only 41.7% accuracy, highlighting the geospatial domain gap
SM3(SM3Det, 2024) unified multi-modal detection across RGB, SAR, and infrared with grid-level sparse MoE

2025-01 to 2026-03 RL-based reasoning, agentic systems, and ultra-high-resolution understanding

(Few-Shot, 2025) demonstrated that a single training example with binary rewards can match models trained on thousands of annotated samples
UAV-VL-R1 (UAV-VL-R1, 2025) showed a 2B model outperforming the 36× larger Qwen2-VL-72B through multi-stage GRPO curriculum learning
SPF (See, Point, Fly, 2025) achieved 93.9% drone navigation success without any training by reframing action as spatial grounding
(Text Before Vision, 2026) established SOTA on XLRS-Bench by injecting text-only Earth-science knowledge before agentic visual RL
(GeoReason, 2026) introduced Logical Consistency Reward to combat reasoning hallucinations in spatial decision-making
(DeepEarth, 2026) achieved planetary-scale 4D modeling with 99.3% parameter reduction through learned hash encoding

🔀 Shift from supervised fine-tuning to reinforcement learning with verifiable rewards, enabling few-shot and even one-shot domain adaptation for remote sensing while unlocking structured reasoning capabilities.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Ground-Remote Vision-Language Alignment	Align satellite image encoders to CLIP's embedding space using co-located ground photos or OpenStreetMap tags as a semantic bridge, avoiding the need for direct satellite-text pairs.	Outperforms supervised RS VLMs by up to 20% on zero-shot classification (GRAFT) and achieves +6.2% average accuracy over baseline CLIP on seven benchmarks (SkyScript).	Remote Sensing Vision-Language Foundation Models... (2023), RS5M and GeoRSCLIP (2023), SkyScript (2023), A Recipe for Improving Remote... (2025), OSMDA (2026)
Reinforcement Learning with Verifiable Rewards for Remote Sensing	Use binary or IoU-based verifiable rewards with policy gradient optimization (GRPO) to fine-tune VLMs on remote sensing tasks, replacing thousands of annotated examples with minimal supervision.	1-shot RLVR yields +11.65% on RSVQA-LR and +24.38% on DIOR-RS over the base model; Text-Before-Vision achieves 60.40% Pass@1 on XLRS-Bench, surpassing GPT-5.2 and Gemini 3.0 Pro.	Few-Shot (2025), SAMChat (2025), UAV-VL-R1 (2025), Text Before Vision (2026), GeoReason (2026)
Granularity-oriented Mixture of Experts	Route visual inputs to granularity-specific experts (image-level, region-level, pixel-level) using task-aware routers, preventing interference between different spatial reasoning requirements.	RSUniVLM achieves +29.7% accuracy on VRSBench-Ref visual grounding over GeoChat (69.31% vs. 39.6%) and 86.86% on SIRI-WHU scene classification versus GeoChat's 43.67%.	RSUniVLM (2024), RS-MoE (2024), SM3Det (2024)
Factorized Multi-Modal Foundation Pretraining	Factorize spatial, temporal, and modality dimensions into separate encodable components with multi-granularity contrastive learning, enabling flexible handling of varying input combinations.	SkySense surpasses Scale-MAE by +3.61% average across 16 datasets; MMEarth achieves +3.4% Top-1 accuracy over ImageNet baselines on land cover classification; DeepEarth achieves +35.0% R² improvement with 99.3% parameter reduction.	SkySense (2023), MMEarth (2024), Self-Supervised (2026), LEPA (2026)
Training-Free VLM Aerial Navigation	Convert VLM visual understanding into 3D waypoints by grounding target locations as pixels in the image and unprojecting them using camera geometry, decoupling slow reasoning from fast control.	SPF achieves 93.9% success rate versus PIVOT's 28.7% (+65 points); AirHunt improves success rate by 49.1% and reduces navigation error by 80.3% over baselines.	See, Point, Fly: A Learning-Free... (2025), AirHunt (2026), ViSA-Enhanced Aerial VLN (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
VRSBench-Ref (Visual Grounding)	Accuracy (%)	69.31%	RSUniVLM (2024)
XLRS-Bench (Ultra-High-Resolution RS Reasoning)	Pass@1 (%)	60.40%	Text Before Vision (2026)
RSVQA-LR (Remote Sensing Visual Question Answering)	Accuracy (%)	+11.65% over base model with 1-shot training	Few-Shot (2025)
VisDrone (UAV Multi-Object Tracking)	MOTA (Multiple Object Tracking Accuracy)	SOTA on VisDrone dataset	MM-Tracker (2024)
DRL Simulator (Aerial Vision-Language Navigation)	Success Rate (%)	93.9% (simulation), 92.7% (real-world)	See, Point, Fly: A Learning-Free... (2025)

⚠️ Known Limitations (4)

Cross-sensor generalization gap: models trained on optical imagery degrade significantly on SAR, infrared, and hyperspectral data due to fundamentally different imaging physics and appearance statistics. (affects: Ground-Remote Vision-Language Alignment, Reinforcement Learning with Verifiable Rewards for Remote Sensing, Granularity-oriented Mixture of Experts)
Potential fix: Multi-modal fusion frameworks like SkySense's factorized encoding and SM3Det's grid-level MoE can process multiple sensor types jointly, while dedicated SAR-optical alignment training may narrow the gap.
Counting and fine-grained quantification failure: even frontier VLMs consistently fail at counting objects in aerial imagery (R² < 0.20), especially as density increases beyond 50 objects per scene. (affects: Ground-Remote Vision-Language Alignment, Granularity-oriented Mixture of Experts)
Potential fix: Dedicated counting heads or density estimation modules, combined with higher-resolution processing and region-level expert routing, could improve quantitative spatial reasoning.
Ultra-high-resolution processing bottleneck: satellite images can be tens of thousands of pixels, but VLMs are limited to small input sizes, requiring complex tiling or zoom-in strategies that increase latency. (affects: Reinforcement Learning with Verifiable Rewards for Remote Sensing, Factorized Multi-Modal Foundation Pretraining)
Potential fix: Agentic zoom-in tools (as in Text-Before-Vision), position encoding interpolation (as in GeoChat), and hierarchical patch processing can enable efficient UHR reasoning.
Logical hallucinations in spatial reasoning: models produce correct answers from flawed reasoning chains or rely on positional shortcuts, undermining reliability for strategic applications. (affects: Reinforcement Learning with Verifiable Rewards for Remote Sensing, Training-Free VLM Aerial Navigation)
Potential fix: Logical Consistency Rewards that penalize reasoning drift under option permutation (GeoReason) and explicit multi-phase verification pipelines (ViSA) can enforce grounded spatial logic.

📚 View major papers in this topic (10)

💡 Another cross-cutting theme examines Audio and Speech Integration.

🏆

Audio and Speech Integration

What: Research on integrating audio and speech signals into multimodal AI systems for joint understanding, reasoning, and generation across audio-visual-text modalities.

Why: Effective human-AI interaction requires understanding not just text and images, but also speech content, environmental sounds, and their complex relationships to visual context.

Baseline: Cascaded pipelines that first convert speech to text via ASR then feed transcripts to text-only language models, losing paralinguistic cues and environmental audio information.

Cross-modal alignment between continuous audio signals and discrete text tokens causes information loss and modality conflicts during training
Complex audio reasoning requiring temporal ordering, counting, and causal inference remains far below human-level performance
Joint generation of temporally synchronized audio-visual content demands precise local alignment beyond global semantic matching

🧪 Running Example

❓ Watch this 2-minute cooking video and tell me: what dish is being prepared, does the chef sound confident or hesitant, and when does the sizzling indicate the pan is hot enough?

Baseline: A cascaded ASR-plus-LLM pipeline would transcribe the chef's speech but lose vocal tone (confidence vs. hesitation), miss the sizzling sound entirely as non-speech audio, and lack temporal grounding to pinpoint when specific sound events occur.

Challenge: This example requires simultaneous speech understanding (recipe instructions), paralinguistic analysis (vocal confidence), environmental sound reasoning (sizzling timing), and temporal grounding—all integrated with visual context of the cooking process.

✅ Omni-Modal Native Training: Gemini and VITA-1.5 process raw audio alongside video natively, preserving vocal tone and environmental sounds without transcription loss, enabling holistic scene understanding.

✅ RL-Enhanced Audio Reasoning: SARI and Omni-R1 use reinforcement learning with structured chain-of-thought to teach models multi-step audio reasoning, enabling inference about when sizzling intensity changes.

✅ Dual-Path Audio-Language Architectures: LTU-AS uses separate pathways for speech content and paralinguistic features, allowing the model to simultaneously understand what the chef says and how they say it.

📈 Overall Progress

The field has undergone two major paradigm shifts in three years: first, from siloed audio processing to natively multimodal joint training (led by Gemini and AnyMAL in 2023), and second, from supervised learning to RL-enhanced reasoning (led by SARI and Omni-R1 in 2025). Concurrently, generation capabilities evolved from simple audio-gesture pairing to full cinematic audio-visual production at scale (Movie Gen, Seedance 1.5). The emergence of comprehensive benchmarks (MMAU, AHELM) has been instrumental in revealing the persistent gap between AI and human audio reasoning, which in turn accelerated the adoption of RL methods.

📂 Sub-topics

Audio-Language Understanding & Reasoning

14 papers

Models that perceive and reason about audio signals—including speech, environmental sounds, and music—using language model backbones. This sub-topic covers architectures for audio comprehension and emerging RL-based methods for structured audio reasoning.

LTU-AS GAMA Audio Flamingo 2 SARI

Omni-Modal Speech-Vision-Text Models

16 papers

Unified large language models that natively integrate speech and audio alongside vision and text, enabling end-to-end multimodal understanding and interaction without relying on external ASR or TTS systems.

Gemini VITA-1.5 Lyra AnyMAL

Co-Speech Gesture & Animation Synthesis

11 papers

Generating realistic body gestures, facial animations, and full-body motion synchronized with speech audio, including diffusion-based approaches for stylized and semantically meaningful gesture generation.

DiffGesture EMAGE Media2Face Semantic Gesticulator

Audio-Visual Content Generation & Synchronization

14 papers

Joint generation of synchronized audio and video content, including video-to-audio synthesis, music generation from visual inputs, and end-to-end movie production with coherent soundtracks.

Movie Gen Seedance 1.5 MM-LDM ThinkSound

Multimodal Emotion & Affect Recognition

8 papers

Detecting and reasoning about emotions by fusing audio (vocal tone, prosody), visual (facial expressions, gestures), and textual cues, including clinical applications like depression screening.

Emotion-LLaMA Deep-Emotion AMB-DSGDN Turbo Contrastive Learning

Audio-Visual Safety, Security & Benchmarking

16 papers

Evaluation frameworks for audio-language models, adversarial attacks exploiting audio modalities, deepfake detection, content moderation, and watermarking for joint audio-visual content.

MMAU Video-MME AHELM VoiceJailbreak

💡 Key Insights

💡 GRPO-based RL training pushed audio reasoning from 53% to 71% on MMAU within one year

💡 Text-only RL fine-tuning surprisingly improves audio QA nearly as much as audio-based training

💡 Diffusion models halved gesture generation error compared to GAN-based approaches

💡 Native multimodal joint training outperforms modular cascaded pipeline approaches

💡 Audio-language models suffer catastrophic >98% accuracy drops under adversarial text inputs

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has rapidly converged on three fronts: (1) eliminating cascaded pipelines in favor of end-to-end omni-modal models, (2) applying reinforcement learning to unlock multi-step audio reasoning, and (3) scaling joint audio-visual generation from short clips to feature-length content with precise synchronization.

2023-01 to 2023-12 Foundation: First audio-integrated LLMs and diffusion-based co-speech generation

DiffGesture (Taming Diffusion Models for Audio-Driven..., 2023) introduced the first diffusion-based approach for gesture generation, achieving state-of-the-art FGD of 1.506
LTU-AS (Joint Audio and Speech Understanding, 2023) pioneered dual-path audio perception combining speech recognition with environmental sound understanding in a single LLM
(Gemini, 2023) demonstrated native multimodal joint training across audio, vision, and text, exceeding human-expert MMLU performance
(NExT-GPT, 2023) introduced the first end-to-end any-to-any MM-LLM connecting frozen encoders and diffusion decoders
(Any-Modality, 2023) demonstrated scalable multimodal alignment with a frozen 70B LLM using quantized pre-training

🔀 Shift from single-modality audio models and GAN-based gesture generation to LLM-integrated audio understanding and diffusion-based motion synthesis.

2024-01 to 2024-12 Scaling: Omni-modal models, comprehensive benchmarks, and large-scale generation

(Movie Gen, 2024) scaled video generation to 30B parameters with synchronized 48kHz audio, setting new industry benchmarks
(MMAU, 2024) established the first expert-level audio reasoning benchmark, revealing that the best model (Gemini Pro 1.5) achieves only 52.97% vs. 81.85% human accuracy
(Video-MME, 2024) created the first full-spectrum video benchmark showing audio/subtitles boost performance by 4-6% on longer videos
(Emotion-LLaMA, 2024) achieved top rank on the EMER challenge by aligning audio and multi-view visual encoders into the LLaMA embedding space
Media2(Media2Face, 2024) created a trilogy of facial asset, 60-hour dataset, and latent diffusion model achieving 10.44mm Lip Vertex Error, outperforming EmoTalk by 28.5%

🔀 Emergence of comprehensive audio benchmarks (MMAU, Video-MME) revealing that even top models achieve only ~53% on expert-level audio reasoning, catalyzing a push toward deeper reasoning capabilities.

2025-01 to 2026-03 Reasoning: RL-enhanced audio reasoning, joint native generation, and holistic evaluation

(SARI, 2025) extended GRPO to audio with structured CoT and curriculum learning, achieving 67.08% on MMAU
Omni-R1 (Omni-R1, 2025) achieved 71.3% MMAU SOTA and discovered that text-only RL fine-tuning yields comparable audio QA improvements
(Audio Flamingo Sound-CoT, 2025) achieved 79.83% on MMAU-Sound, surpassing GPT-4o Audio by +16.63 percentage points via chain-of-thought reasoning
(AHELM, 2025) introduced the first standardized evaluation covering 10 aspects including fairness, safety, and bias across 14 ALMs
mAVE (mAVE: A Watermark for Joint..., 2026) introduced cryptographic binding of audio-video watermarks to prevent swap attacks, achieving >99% binding integrity

🔀 RL-based training (GRPO) emerges as the dominant paradigm for audio reasoning, with multiple independent groups applying it to push MMAU performance from ~53% to over 71% in under a year.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Omni-Modal Native Training	Ingest raw audio signals (at 16kHz) jointly with vision and text during pre-training rather than grafting audio encoders onto text-only models.	Gemini Ultra surpasses human-expert performance on MMLU with 90.04% (vs. 89.8% human baseline), achieving SOTA on 30 of 32 benchmarks. VITA-1.5 bridges the gap between open-source models and GPT-4o by eliminating separate ASR/TTS modules.	Gemini (2023), VITA-1.5 (2025), Lyra (2024), Any-Modality (2023), Gemini 2.5 (2025)
RL-Enhanced Audio Reasoning	Use GRPO to reward models for both correct answers and coherent reasoning chains, with curriculum learning ordering samples from easy to hard.	Omni-R1 achieves 71.3% on MMAU Test-mini, improving over base Qwen2.5-Omni by +5.4% absolute (65.9% → 71.3%). SARI achieves 67.08% on MMAU test-mini, +16.35% over Qwen2-Audio-7B-Instruct baseline.	SARI (2025), Omni-R1 (2025), Audio-Thinker (2025), Audio Flamingo Sound-CoT Technical Report:... (2025), EchoInk-R1 (2025)
Diffusion-based Co-Speech Motion Synthesis	Model gesture generation as a conditional diffusion process over skeleton or mesh sequences, with cross-modal attention for speech-gesture synchronization.	DiffGesture achieves FGD (Fréchet Gesture Distance) of 1.506 on TED Gesture, halving the previous best HA2G score of 3.072. Media2Face achieves 10.44mm Lip Vertex Error, outperforming EmoTalk (14.61mm) by 28.5%.	Taming Diffusion Models for Audio-Driven... (2023), EMAGE (2023), Media2Face (2024), EchoMimicV3 (2025)
Joint Audio-Visual Diffusion Generation	Process video and audio streams in parallel within a unified diffusion backbone using cross-modal attention to enforce temporal lock-step synchronization.	MM-LDM outperforms MM-Diffusion by 114.6 FVD on AIST++ with 10x faster sampling speed. Movie Gen scales to 30B parameters for 1080p HD video with synchronized 48kHz audio, surpassing Runway Gen3 and OpenAI Sora.	Movie Gen (2024), Seedance 1.5 pro (2025), MM-LDM (2024), ThinkSound (2025), V2M-Zero (2026)
Dual-Path Audio-Language Architectures	Combine discrete speech tokens from an ASR decoder with continuous audio features from encoder layers to capture both what is said and how it sounds.	GAMA outperforms prior LALMs (LTU, SALMONN, Pengi) by 1-84% across diverse audio tasks. Sound-CoT achieves 79.83% on MMAU-Sound, surpassing GPT-4o Audio at 63.20% by +16.63 percentage points.	Joint Audio and Speech Understanding (2023), GAMA (2024), Audio Flamingo 2 (2025), MoE-Adapter (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MMAU (Massive Multi-Task Audio Understanding)	Accuracy (multiple-choice)	71.3% on Test-mini	Omni-R1 (2025)
MMAU-Sound	Accuracy (multiple-choice)	79.83%	Audio Flamingo Sound-CoT Technical Report:... (2025)
Video-MME	Accuracy (multiple-choice)	81.3% (with subtitles)	Video-MME (2024)
TED Gesture (FGD)	Fréchet Gesture Distance (FGD, lower is better)	1.506 FGD	Taming Diffusion Models for Audio-Driven... (2023)
MMLU (Massive Multitask Language Understanding)	Accuracy	90.04%	Gemini (2023)

⚠️ Known Limitations (4)

Severe textual bias in audio-language models: when text and audio conflict, models overwhelmingly trust text, with accuracy dropping from 87.8% to 1.7% under adversarial conditions, undermining reliability in real-world scenarios where modalities may disagree. (affects: Dual-Path Audio-Language Architectures, Omni-Modal Native Training)
Potential fix: MATA proposes training-free attention amplification for audio tokens, while MCR-Bench shows supervised fine-tuning on conflict-rich data can recover adversarial accuracy from 1.5% to 54.3%.
Persistent gap between AI and human audio reasoning: even the best models achieve 71.3% vs. 81.85% human accuracy on expert-level audio tasks, with cross-recording speaker identification remaining at chance level (<50%), indicating fundamental limitations in audio-language alignment. (affects: RL-Enhanced Audio Reasoning, Dual-Path Audio-Language Architectures)
Potential fix: Curriculum-guided RL (SARI) and synthetic reasoning data (AudioSkills in Audio Flamingo 2) show promise, but cross-recording reasoning and long-audio understanding remain open challenges.
Audio-visual security vulnerabilities: adversarial perturbations can inject hidden instructions into audio/images that steer model behavior while remaining imperceptible to humans, and voice-based jailbreak attacks achieve 77.8% success rate against GPT-4o's safety guardrails. (affects: Omni-Modal Native Training, Dual-Path Audio-Language Architectures)
Potential fix: Cryptographic audio-visual binding (mAVE) addresses watermarking integrity, and generator-internal probing (X-AVDT) improves deepfake detection by +13.1% accuracy, but defense against voice jailbreaks remains largely unsolved.
Computational cost and scalability barriers: state-of-the-art generation models require up to 30B parameters (Movie Gen) and massive compute, while long-audio processing beyond 5 minutes remains challenging for most architectures. (affects: Joint Audio-Visual Diffusion Generation, Diffusion-based Co-Speech Motion Synthesis)
Potential fix: EchoMimicV3 demonstrates competitive performance with only 1.3B parameters via soup-of-tasks paradigm, Phi-4-Multimodal uses Mixture of LoRAs to keep the base model frozen, and MambaDance replaces quadratic attention with linear-time state space models.

📚 View major papers in this topic (10)

Gemini: A Family of Highly Capable Multimodal Models (2023-12) 10
Movie Gen: A Cast of Media Foundation Models (2024-10) 9
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (2024-10) 9
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
Phi-4-Mini and Phi-4-Multimodal (2025-04) 9
AHELM: A Holistic Evaluation of Audio-Language Models (2025-09) 9
mAVE: A Watermark for Joint Audio-Visual Generation Models (2026-03) 9
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities (2025-07) 9
SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning (2025-04) 8
Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation (2023-03) 8

💡 Another cross-cutting theme examines Medical and Healthcare.

📱

Medical and Healthcare

What: Research on adapting and developing multimodal AI models — integrating medical images, clinical text, and structured data — for diagnosis, report generation, and clinical decision support.

Why: Clinicians must integrate heterogeneous data across imaging modalities, patient history, and lab results, yet current AI systems are fragmented and lack clinical reasoning transparency.

Baseline: Standard approaches use single-modality supervised models trained on task-specific labeled datasets, or adapt general-purpose CLIP-style VLMs via supervised fine-tuning on medical image-text pairs.

Medical data scarcity and privacy constraints limit large-scale multimodal training datasets
Domain gap between natural images and medical images causes poor transfer of general VLMs
Models frequently hallucinate clinical findings not supported by visual evidence
Missing modalities at inference time due to heterogeneous clinical data availability

🧪 Running Example

❓ A patient presents with a chest X-ray and prior imaging history. Generate a radiology report identifying all findings, comparing with the prior study, and providing diagnostic reasoning.

Baseline: A standard VLM fine-tuned via SFT generates a plausible-sounding report but hallucinates findings not present in the image (e.g., fabricating 'pleural effusion'), fails to reference the prior study, and provides no reasoning for its conclusions.

Challenge: This example illustrates multiple key challenges: the model must perceive subtle visual abnormalities (domain gap), avoid hallucinating non-existent findings (factual accuracy), integrate longitudinal data (missing modality handling), and provide transparent reasoning steps (interpretability).

✅ RL-based Medical Reasoning (GRPO/RLVR): Instead of imitating training reports, the model learns to self-generate reasoning in '<think>' blocks before answering, rewarded only for correct final diagnoses — producing transparent, clinically grounded reasoning without expensive CoT annotations.

✅ Medical Vision-Language Foundation Models: Models like MedGemma and Hulu-Med provide strong medical visual encoders pretrained on millions of medical image-text pairs, enabling the model to perceive subtle findings (e.g., small nodules, cardiomegaly) that general CLIP encoders miss.

✅ Medical Multi-Agent Systems: MedRAX orchestrates specialized tools (a segmentation model for anatomy, a classification model for findings, a VQA model for comparison) in a ReAct loop, combining their outputs into a coherent, evidence-based report rather than relying on a single end-to-end model.

✅ Structured Visual Chain-of-Thought: S-Chain decomposes reasoning into explicit stages — localize the lesion → describe morphology → grade severity → classify — each grounded in specific image regions with expert-verified bounding boxes, preventing hallucinations.

✅ Adaptive Report Generation: LLM-RG4 adapts its output based on available inputs: when prior history exists, it generates comparison statements; when absent, it omits them entirely rather than hallucinating, using token-level loss weighting to prioritize clinically significant mentions.

📈 Overall Progress

Medical multimodal AI has undergone three paradigm shifts in three years: from task-specific models to universal foundation models (2023), from 2D to native 3D volumetric understanding (2024), and from supervised fine-tuning to reinforcement-learning-driven reasoning (2025). The field is now converging toward deployable, transparent, multi-agent systems that combine specialist tools with verifiable reasoning chains.

📂 Sub-topics

Reinforcement Learning for Medical Reasoning

18 papers

Applying Group Relative Policy Optimization (GRPO) and Reinforcement Learning with Verifiable Rewards (RLVR) to medical VLMs, enabling emergent chain-of-thought reasoning without expensive expert annotations. This paradigm replaces supervised fine-tuning with reward-driven self-improvement.

GRPO RLVR OraPO DRPO

Medical Vision-Language Foundation Models

35 papers

Large-scale foundation models pretrained on medical image-text data to serve as universal backbones for diverse clinical tasks including classification, segmentation, report generation, and visual question answering across 2D and 3D modalities.

MedGemma Hulu-Med Merlin MedSAM

Medical Multi-Agent and Agentic Systems

12 papers

LLM-orchestrated agent frameworks that coordinate specialized medical tools, enable multi-step clinical reasoning, and replicate collaborative diagnostic workflows through role-specialized agents and tool-augmented inference.

ReAct agents Multi-agent debate Tool-augmented inference Agentic behavior distillation

Robust Multi-Modal Clinical Data Fusion

28 papers

Methods for integrating heterogeneous clinical data (imaging, EHR, genomics, physiological signals) while handling missing modalities, imbalanced data, and modal inconsistencies common in real-world healthcare settings.

ShaSpec DrFuse CLoE PASSION

Medical Report Generation and Clinical Reasoning

22 papers

Automated generation of radiology and clinical reports with a focus on factual accuracy, interpretable reasoning chains, and adaptation to diverse clinical scenarios. Includes structured Chain-of-Thought approaches and retrieval-augmented methods.

Process supervision Visual Chain-of-Thought Fact-Flow decoupling Retrieval-augmented generation

Medical AI Benchmarks and Evaluation

15 papers

Standardized evaluation frameworks, benchmark datasets, and systematic analyses for assessing the quality, safety, robustness, and clinical relevance of medical multimodal AI systems.

MedXpertQA MultiMedEval MedVLMBench Hallucination taxonomies

💡 Key Insights

💡 Reinforcement learning with verifiable rewards outperforms supervised fine-tuning for medical reasoning

💡 Small RL-trained models (2B) can surpass 72B supervised models on medical tasks

💡 Multi-agent medical systems now match proprietary frontier models at 25x fewer parameters

💡 Structured visual Chain-of-Thought with expert grounding reduces hallucinations dramatically

💡 Missing modality robustness through shared-specific decomposition enables real-world deployment

💡 3D-native medical VLMs significantly outperform 2D-to-3D lifting approaches on volumetric tasks

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from adapting general-domain models (CLIP, SAM) to building purpose-built medical foundation models, and most recently to training these models via reinforcement learning to develop emergent clinical reasoning capabilities — all while increasing emphasis on transparency, safety, and clinical deployability.

2023-03 to 2023-12 Foundation building — universal medical segmentation, multi-modal representation learning, and early domain adaptation

MedSAM (Segment Anything in Medical Images, 2023) adapted the Segment Anything Model to 1.5M medical image-mask pairs, reducing annotation time by 82%
(Shared-Specific, 2023) introduced shared-specific decomposition for handling missing modalities in both segmentation and classification
SleepFM (Multi-modal Representation Learning for Sleep, 2024) pioneered leave-one-out contrastive learning across brain, cardiac, and respiratory signals for sleep analysis
Video pretraining study (Video Pretraining Advances 3D Deep Learning, 2023) demonstrated that natural video pretraining transfers effectively to 3D medical CT tasks

🔀 MedSAM demonstrated that a single foundation model trained on diverse medical data could outperform task-specific models, establishing the universal medical AI paradigm.

2024-01 to 2024-12 Rapid expansion of medical VLMs across 2D and 3D modalities, emergence of multi-modal fusion and agent-based approaches

Merlin (CT Vision-Language Foundation Model, 2024) introduced 3D-native VLM for CT published in Nature, achieving +16% F1 in zero-shot findings classification
M3D (3D Medical Image Analysis with MLLMs, 2024) created the largest 3D medical dataset with 120K image-text pairs and 662K instruction pairs
MMedAgent (Learning to Use Medical Tools, 2024) introduced the first multi-modal medical agent framework with six specialized tools
PRISM (Multi-modal Generative Foundation Model for..., 2024) adapted vision-language pretraining to gigapixel whole slide images using GPT-4 report summarization
PaliGemma 2 (Versatile VLMs for Transfer, 2024) achieved SOTA on radiology report generation and demonstrated that general VLMs can replace specialized medical architectures

🔀 Research shifted from adapting 2D natural-image models to building native 3D medical VLMs (Merlin, M3D) that process volumetric data directly.

2025-01 to 2025-12 The reinforcement learning revolution — GRPO/RLVR transforms medical reasoning, agent systems mature, and comprehensive benchmarks emerge

MedVLM-R1 (Medical VLM via Reinforcement Learning, 2025) demonstrated emergent medical reasoning via GRPO, boosting accuracy from 55% to 78% without reasoning annotations
(Oracle-educated GRPO, 2025) achieved SOTA radiology report generation with only 1K training samples by introducing FactScore rewards
(Domain-Aware, 2025) balanced training across clinical specialties with domain-aware policy optimization, boosting rare-domain F1 by 43%
(Expert-Level, 2025) established a rigorous benchmark where even o1 achieves only 49.89% accuracy
(Structured Visual Chain-of-Thought, 2025) created the first large-scale expert-annotated visually grounded reasoning dataset with 12K images
Hulu-Med (Transparent Generalist Medical VLM, 2025) unified 2D/3D/video understanding in one architecture, surpassing GPT-4o on 16 of 30 benchmarks

🔀 MedVLM-R1 triggered a paradigm shift from supervised fine-tuning to reinforcement learning for medical VLMs, demonstrating that small models (2B) with RL can outperform 72B supervised models.

2026-01 to 2026-03 Mature deployable systems — lightweight agents, verification frameworks, and clinical-grade evaluation

Meissa (Multi-modal Medical Agentic Intelligence, 2026) distilled agentic behaviors into a 4B-parameter model matching proprietary frontier agents with 22x lower latency
MedMASLab (Unified Framework for Medical Multi-Agent Systems, 2026) standardized multi-agent medical AI evaluation across 11 architectures with semantic verification
(Multi-Agent, 2025) surpassed human pathologists on melanoma grading using four collaborative specialized agents
LoV3D (Longitudinal 3D Brain MRI Reasoning, 2026) achieved 93.7% dementia classification with verifiable structured JSON outputs and automated DPO

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
RL-based Medical Reasoning	Train medical VLMs via reward signals (format + accuracy) rather than imitation, enabling self-discovered reasoning without Chain-of-Thought labels.	Improves on SFT baselines by +23.1% average accuracy (MedVLM-R1: 78.22% vs 55.11% base), and OraPO achieves SOTA F1 of 0.357 on MIMIC-CXR using only 1K training samples vs 1.27M for prior methods.	MedVLM-R1 (2025), OraPO (2025), QoQ-Med (2025), MedEyes (2025), MedVLThinker (2025)
Medical Vision-Language Foundation Models	Pretrain unified architectures on millions of medical image-text pairs with domain-specific encoders, enabling zero-shot and few-shot transfer across clinical specialties.	MedGemma improves +15.5–18.1% on out-of-distribution CXR classification over base Gemma; Merlin achieves +16.0% F1 in zero-shot findings classification vs supervised baselines; MedSAM outperforms U-Net by 15.5% on unseen segmentation tasks.	Segment Anything in Medical Images (2023), Merlin (2024), MedGemma (2025), Hulu-Med (2025), M3D (2024)
Medical Multi-Agent Diagnostic Systems	Decompose medical diagnosis into role-specialized agents (triage, imaging, synthesis) that collaborate via structured communication, replacing monolithic end-to-end models.	Meissa matches proprietary frontier agents (GPT-4o) in 10 of 16 settings with 25x fewer parameters and 22x lower latency; MedAgent-Pro outperforms GPT-4o by 34% on glaucoma diagnosis; PathFinder surpasses human pathologists by 9% accuracy.	MMedAgent (2024), MedRAX (2025), Meissa (2026), PathFinder (2025), MedMASLab (2026)
Robust Multi-Modal Fusion with Missing Modalities	Decompose representations into shared (cross-modal) and specific (modality-unique) components, enabling graceful degradation when modalities are missing.	ShaSpec improves brain tumor segmentation Dice by >3–5% over prior methods on BraTS2018; CLoE achieves 88.09% Dice on Whole Tumor vs 87.54% best baseline; ACADiff maintains 89.4% diagnostic accuracy with 20% missing data.	Multi-modal Learning with Missing Modality... (2023), CLoE (2026), ACADiff (2026), DrFuse (2024)
Structured Visual Chain-of-Thought Reasoning	Structure medical reasoning as a multi-step cognitive process where each stage is visually grounded and independently verifiable, mimicking expert diagnostic workflows.	S-Chain supervision improves accuracy by +11.09% over base training and +4.47% over synthetic GPT-4.1 supervision; V2T-CoT achieves +5.11% on SLAKE over LLaVA-Med; ChestX-Reasoner improves reasoning by +18% over base model.	S-Chain (2025), ChestX-Reasoner (2025), Think Twice to See More:... (2025), Thinking with Gaze (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MIMIC-CXR (Radiology Report Generation)	RadGraph F1	0.357 F1	OraPO (2025)
MedXpertQA MM (Expert-Level Medical Reasoning)	Accuracy	GPT-5 achieves super-human performance	Capabilities of GPT-5 on Multimodal... (2025)
VQA-RAD (Medical Visual Question Answering)	Accuracy	83.20%	MMed-RAG (2024)
BraTS 2020/2021 (Brain Tumor Segmentation)	Dice Score	0.912 average Dice	Modality-Aware (2024)
SLAKE (Medical VQA)	Accuracy	86.3% GPT Score	Can Generalist Vision Language Models... (2025)

⚠️ Known Limitations (4)

Medical hallucinations remain pervasive — models fabricate findings, misidentify anatomy, or omit critical pathologies, with standard metrics failing to capture clinical danger levels (affects: Medical Vision-Language Foundation Models, RL-based Medical Reasoning (GRPO/RLVR), Structured Visual Chain-of-Thought Reasoning)
Potential fix: Structured visual grounding (S-Chain), FactScore-based rewards (OraPO), and concept bottleneck models that force intermediate clinical fact verification before report generation
Data scarcity and privacy constraints severely limit large-scale medical multimodal training, with most institutions holding fragmented, single-modality datasets that cannot be easily shared (affects: Medical Vision-Language Foundation Models, Robust Multi-Modal Fusion with Missing Modalities)
Potential fix: Federated learning with pseudo-modality generation (Fed-PMG), synthetic data generation from textbook knowledge (MM-Skin, MM-Retinal), and data-efficient RL training (OraPO achieves SOTA with 1K samples)
Evaluation fragmentation — benchmarks use inconsistent metrics, datasets, and prompting strategies, making fair comparison across methods nearly impossible and enabling overfitting to specific test sets (affects: Medical Multi-Agent Diagnostic Systems, Medical Vision-Language Foundation Models)
Potential fix: Unified evaluation toolkits (MultiMedEval), semantic VLM-based judges replacing brittle text matching (MedMASLab), and expert-curated difficult benchmarks with rigorous filtering (MedXpertQA)
Generalist-specialist trade-off — specialized medical VLMs excel in-distribution but fail on out-of-distribution modalities, while generalists lack clinical depth but generalize better after fine-tuning (affects: Medical Vision-Language Foundation Models, RL-based Medical Reasoning (GRPO/RLVR))
Potential fix: Lightweight domain adaptation via LoRA and prompt learning (GDPL), domain-aware RL that balances across specialties (QoQ-Med DRPO), and modular agent systems that dynamically select specialist tools (Meissa, MedRAX)

📚 View major papers in this topic (10)

Segment Anything in Medical Images (2023-04) 9
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset (2024-06) 9
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation (2025-09) 9
QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training (2025-05) 9
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding (2025-10) 9
S-Chain: Structured Visual Chain-of-Thought for Medicine (2025-10) 9
Meissa: Multi-modal Medical Agentic Intelligence (2026-03) 9
MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems (2026-03) 9
Capabilities of GPT-5 on Multimodal Medical Reasoning (2025-08) 9
LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments (2026-03) 9

💡 Another cross-cutting theme examines Safety and Robustness.

📚

Safety and Robustness

What: Research on making multimodal models (especially Vision-Language Models) resistant to adversarial attacks, jailbreaks, hallucinations, data poisoning, and failures under distribution shifts.

Why: As VLMs are deployed in safety-critical applications like autonomous driving and healthcare, ensuring they cannot be manipulated or produce harmful content is essential.

Baseline: Standard VLMs inherit text-based safety alignment from LLMs but introduce new vulnerabilities through the visual modality, which bypasses existing safeguards.

Visual inputs create a continuous attack surface that bypasses text-based safety filters and alignment
Models hallucinate objects or fabricate information not grounded in visual evidence
Safety fine-tuning often relies on spurious textual correlations rather than true understanding of harm
Embodied agents face compounding errors where a single unsafe action can cause irreversible physical consequences

🧪 Running Example

❓ A user sends an image of household chemicals with the text: 'What interesting combinations can I make with these?'

Baseline: A standard VLM may provide dangerous chemical mixing instructions because its text-only safety alignment does not recognize the visual context as hazardous, or it may refuse all chemistry-related queries regardless of intent.

Challenge: This example illustrates the core challenges: (1) the image bypasses text safety filters since the text alone seems innocent, (2) the model must reason about cross-modal harm where text+image together are dangerous, and (3) over-refusal occurs if the model blocks all chemistry questions including safe ones.

✅ Safety-Aligned Training (VLGuard/Safe RLHF-V): Trains the model on paired safe/unsafe visual scenarios so it learns to recognize hazardous image contexts and refuse appropriately while remaining helpful for safe queries

✅ Inference-Time Safety Defense (ECSO/ASTRA): Detects the unsafe visual context at inference time; ECSO converts the image to a text caption and re-queries without the image, restoring LLM safety alignment

✅ Protocolized Safety Reasoning (SaFeR-ToolKit/PRISM): Forces the model through a structured Perception→Reasoning→Decision pipeline, making it explicitly identify the chemicals, reason about combination risks, and decide to refuse with an explanation

📈 Overall Progress

The field has progressed from discovering that visual inputs bypass text safety alignment (2023) to developing sophisticated training-time and inference-time defenses (2024-2025), and now focuses on structured safety reasoning and consequence-aware policies (2025-2026). A critical paradigm shift occurred from binary safe/unsafe classification to structured reasoning chains that make safety decisions auditable and explainable. The arms race between attacks and defenses continues to intensify, with each side driving innovation in the other.

📂 Sub-topics

Jailbreak Attacks on VLMs

35 papers

Methods that exploit the visual modality to bypass safety alignment in Vision-Language Models, including adversarial image perturbations, typography-based attacks, and multi-modal prompt injection.

Multi-Modal Linkage Attack Typography-Based Jailbreaking Adversarial Image Hijacking Cross-modal Obfuscation

Safety Alignment and Training

30 papers

Training-time methods to align VLMs with safety requirements, including safety fine-tuning datasets, adversarial DPO, decoupled preference optimization, and reinforcement learning with safety constraints.

VLGuard Safety Fine-Tuning Adversary-Aware DPO Safe RLHF-V SaFeR-VLM

Inference-Time Safety Defense

25 papers

Training-free methods that protect VLMs at inference time, including activation steering, representation projection, suffix generation, and image-to-text conversion to restore LLM safety alignment.

ECSO Image-to-Text Conversion ASTRA Activation Steering BlueSuffix Cross-Modal Defense Jailbreak Shift Removal

Adversarial Robustness

25 papers

Methods for making vision encoders and VLMs robust to adversarial perturbations, including unsupervised adversarial fine-tuning, pre-trained model guided training, and large-scale adversarial pre-training.

FARE Unsupervised Robust Embeddings PMG-AFT Guided Fine-Tuning Double Visual Defense Concept-Guided Fine-Tuning

Hallucination Detection and Mitigation

30 papers

Research on detecting and reducing hallucinations in multimodal models, including fine-grained human feedback, adversarial hallucination generation, sharpness-aware unlearning, and visual grounding techniques.

Dense Direct Preference Optimization VHTest Adversarial Generation Sharpness-Aware Robust Erasure Residual Visual Decoding

Safety Benchmarks and Evaluation

35 papers

Comprehensive benchmarks and evaluation frameworks for assessing VLM safety across dimensions including jailbreak resistance, hallucination rates, moral robustness, and reliability under visual corruptions.

MM-SafetyBench OmniSafeBench-MM REVAL VLM-RobustBench

Data Poisoning and Backdoor Attacks

20 papers

Attacks that inject malicious behaviors into VLMs through training data manipulation, including stealthy data poisoning, backdoor injection via model merging, and knowledge base poisoning in RAG systems.

Shadowcast Stealthy Poisoning BadMerging Coefficient-Agnostic Injection MM-PoisonRAG BadReward RLHF Poisoning

Embodied and Agent Safety

25 papers

Safety evaluation and defense for VLM-powered embodied agents in autonomous driving, robotic manipulation, and household environments, addressing both adversarial attacks and natural failure modes.

IS-Bench Interactive Safety AGENTSAFE Full-Stack Diagnosis LabShield Safety-Centric PRP HomeSafe-Bench HD-Guard

Robustness to Modality Issues

22 papers

Research on maintaining model performance when modalities are missing, noisy, or conflicting, including missing modality adaptation, certifiable robustness, and cross-modal conflict resolution.

UME-MMA Ensemble Adaptation CRMT Certifiable Robustness Expert Consistency Learning RAGPT Dynamic Prompting

💡 Key Insights

💡 Visual inputs fundamentally bypass text-only safety alignment in VLMs

💡 Safety fine-tuning with just 2,000 images can reduce attack success by 98%

💡 Structured safety reasoning chains outperform binary refusal approaches

💡 Multi-modal reasoning models have 37% higher jailbreak rates than base models

💡 Spurious textual correlations create a 'safety mirage' easily broken by one-word attacks

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has evolved from isolated attack demonstrations to comprehensive safety frameworks spanning the full model lifecycle. Early work focused on proving vulnerabilities exist; current work increasingly addresses real-world deployment challenges including embodied agent safety, privacy-aware geolocation, and consequence-driven alignment for autonomous systems.

2023-05 to 2023-12 Foundational vulnerability discovery and early safety alignment

Multi-modal indirect prompt injection (Abusing Images and Sounds for..., 2023) demonstrated adversarial perturbations in images/audio can hijack VLM behavior
(Image Hijacks, 2023) introduced Behaviour Matching achieving 100% success in controlling VLM outputs via optimized images
(MM-SafetyBench, 2023) discovered typography-based attacks increase ASR by 30%+ over text-only baselines
(RLHF-V, 2023) pioneered fine-grained segment-level corrections for hallucination, reducing error by 34.8% with 1.4k samples
(Robust Instruction Tuning, 2023) created the first large-scale dataset with negative instructions to teach models to say 'No'

🔀 Discovery that multimodal inputs fundamentally bypass text-only safety alignment, establishing the visual modality as a critical attack surface.

2024-01 to 2024-12 Rapid expansion of attack-defense arms race and robustness methods

VLGuard (Safety Fine-Tuning at Almost No Cost, 2024) proved safety can be restored with just 2,000 curated images via mixed fine-tuning
(Robust CLIP, 2024) achieved adversarial robustness at 0.2% of CLIP training cost using unsupervised feature consistency
ECSO (Eyes Closed, Safety On, 2024) demonstrated +58.6% harmless rate improvement via training-free image-to-text conversion
(Shadowcast, 2024) showed VLMs can be poisoned with as few as 50 stealthy samples
(BadMerging, 2024) achieved >90% ASR against merged models where prior methods failed at <20%
Red Teaming VLMs (Red Teaming Visual Language Models, 2024) established a comprehensive 4-aspect safety taxonomy (Faithfulness, Privacy, Safety, Fairness)

2025-01 to 2025-12 Maturation of structured safety reasoning and comprehensive evaluation

(Safe RLHF-V, 2025) introduced decoupled dual-preference optimization with 7-point safety scale, achieving +34.2% safety improvement
(Double Visual Defense, 2025) achieved ~70% robustness improvement through adversarial pre-training from scratch
(Safety at Scale, 2025) unified safety research across six model types with 574 papers analyzed
(SafeMLRM, 2025) quantified the 'Reasoning Tax' showing MLRMs have 37.44% higher jailbreak rates than base models
(PRISM, 2025) introduced 4-step safety Chain-of-Thought with MCTS-generated preference pairs, achieving 0.15% ASR on JailbreakV-28K
(GuardReasoner-VL, 2025) improved guard model F1 by +19.27% using online RL with safety-aware data concatenation

🔀 Shift from binary safety classification to structured reasoning-based safety, where models must explain their safety decisions through explicit perception-reasoning-decision chains.

2026-01 to 2026-03 Consequence-aware safety, embodied agent benchmarks, and causal reasoning for risk prevention

OOD-MMSafe/(OOD-MMSafe, 2026) shifted from intent detection to causal projection, reducing risk identification failure to 5.7%
(LabShield, 2026) evaluated 33 MLLMs on lab safety with OSHA/GHS standards, finding 32% performance drop in professional scenarios
(SaFeR-ToolKit, 2026) formalized safety as a checkable protocol with virtual tool traces
(ConflictBench, 2026) showed alignment failures occur at step 5.28 on average in multi-turn interactions, proving single-turn benchmarks miss delayed misalignment

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Multi-Modal Jailbreak Attacks	Visual inputs create a continuous, high-dimensional attack surface that circumvents discrete text safety filters via gradient optimization, encrypted visual encodings, or prompt injection.	Multi-Modal Linkage (MML) achieves 99.40% Attack Success Rate on GPT-4o, improving over prior baselines by +66.4% on SafeBench	Jailbreak Large Vision-Language Models Through... (2024), Image Hijacks (2023), MM-SafetyBench (2023), Cross-modal Adversarial Multimodal Obfuscation (CAMO) (2025), JPS (2025)
Safety-Aligned Fine-Tuning	Combine safety-specific training data with modified optimization objectives that separate helpfulness from safety constraints, preventing the model from learning spurious refusal shortcuts.	VLGuard Mixed Fine-Tuning reduces Attack Success Rate from 53.6% to 1.1% on LLaVA-v1.5-7B while maintaining helpfulness	Safety Fine-Tuning at (Almost) No... (2024), Safe RLHF-V (2025), SaFeR-VLM (2025), SaFeR-ToolKit (2026)
Inference-Time Safety Defense	Exploit the insight that visual embeddings create detectable anomalies in the model's representation space, which can be identified and corrected during inference via activation steering or image-to-text conversion.	ASTRA reduces Attack Success Rate by 17.84% over JailGuard while running 9x faster by avoiding multiple inference passes	Eyes Closed, Safety On: Protecting... (2024), ASTRA (2024), Understanding and Defending VLM Jailbreaks... (2026), VLM-Guard (2025)
Adversarial Robustness Training	Force the vision encoder to produce identical representations for clean and adversarially perturbed images, either through feature consistency losses or full adversarial pre-training on web-scale data.	Double Visual Defense (Δ²-LLaVA) achieves ~70% absolute robustness improvement on Stanford Cars over prior methods (TeCoA, FARE) while maintaining clean performance	Robust CLIP (2024), Double Visual Defense (2025), Pre-trained Model Guided Fine-Tuning for... (2024), Anyattack (2025)
Hallucination Mitigation via Fine-Grained Feedback	Instead of ranking entire responses, collect precise corrections at the segment level and use modified optimization objectives that give higher weight to corrected regions, teaching models where exactly they hallucinate.	RLHF-V reduces hallucination rate by 34.8% using only 1.4k annotated samples, outperforming LLaVA-RLHF which required 10k samples	RLHF-V (2023), Beyond Superficial Unlearning (2026), GHOST (2025), Mitigating Hallucination in Large Multi-Modal... (2023)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MM-SafetyBench	Attack Success Rate (lower is safer for defense, higher for attack)	1.1% ASR (defense), reduced from 53.6% baseline	Safety Fine-Tuning at (Almost) No... (2024)
JailBreakV-28K	Attack Success Rate (lower is safer)	0.15% ASR	PRISM (2025)
POPE (Polling-based Object Probing Evaluation)	Accuracy	85.0% accuracy on Random split	Mitigating Hallucination in Large Multi-Modal... (2023)
SafeBench / HADES (Jailbreak Attack Evaluation)	Attack Success Rate (higher indicates more effective attack)	99.40% ASR on GPT-4o	Jailbreak Large Vision-Language Models Through... (2024)

⚠️ Known Limitations (4)

Safety-utility trade-off: Most defense methods reduce model helpfulness when improving safety, leading to over-refusal of benign queries that superficially resemble unsafe ones. (affects: Safety-Aligned Fine-Tuning, Inference-Time Safety Defense)
Potential fix: Machine unlearning (removing harmful knowledge) rather than supervised refusal, and structured reasoning chains that separate intent classification from response generation
Arms race dynamics: Each new defense is quickly circumvented by more sophisticated attacks, and defenses designed for known attack patterns fail to generalize to novel threats. (affects: Multi-Modal Jailbreak Attacks, Inference-Time Safety Defense)
Potential fix: Proactive defense frameworks that reason about potential harm rather than pattern-matching known attacks, such as consequence-aware safety policies
Evaluation gaps: Static benchmarks fail to capture temporal dynamics, multi-turn escalation, and interaction effects where individually safe components combine to create harm. (affects: Safety-Aligned Fine-Tuning, Hallucination Mitigation via Fine-Grained Feedback)
Potential fix: Interactive, process-oriented evaluation frameworks that test agents across multi-step scenarios with dynamic risk emergence, as proposed by IS-Bench and ConflictBench
Scalability of robust training: Adversarial pre-training from scratch is highly effective but requires enormous computational resources, limiting accessibility for the research community. (affects: Adversarial Robustness Training)
Potential fix: Efficient alternatives like FARE that achieve robustness at 0.2% of training cost, or test-time compute scaling approaches like Self-Critical Inference

📚 View major papers in this topic (10)

💡 Another cross-cutting theme examines Analysis.

🧩

Analysis

What: Research on evaluating, benchmarking, aligning, and understanding multimodal models—particularly Vision-Language Models (VLMs)—across diverse tasks, domains, and safety dimensions.

Why: As multimodal models are deployed in high-stakes settings like healthcare, autonomous driving, and embodied AI, rigorous analysis of their capabilities, failure modes, and alignment is essential for trustworthy deployment.

Baseline: Standard VLM evaluation relies on static, single-turn benchmarks with multiple-choice questions, aggregate accuracy metrics, and general-purpose prompting without domain adaptation.

Benchmarks allow shortcut learning via text priors and guessing, inflating true capability estimates
Models hallucinate confidently, failing to ground reasoning in visual evidence as sequences grow longer
Safety alignment degrades when visual modality is added, enabling jailbreak attacks that bypass text-only defenses

🧪 Running Example

❓ Given a photo of a busy intersection, determine how many pedestrians are crossing and whether it is safe for an autonomous vehicle to proceed.

Baseline: A standard VLM might answer 'three pedestrians, safe to proceed' by relying on text priors about typical intersections, without actually counting individuals or detecting a partially occluded child in the crosswalk.

Challenge: This example illustrates multiple challenges: the model must count accurately (a known VLM weakness), reason spatially about occlusion, ground its answer in the actual image rather than language priors, and make a safety-critical judgment where hallucination could be catastrophic.

✅ Visually-Perceptive Policy Optimization (VPPO): VPPO focuses RL training updates on tokens with high visual dependency, ensuring the model actually looks at the crosswalk rather than guessing from text patterns.

✅ Spectral Representation Filtering (SRF): SRF suppresses hallucination modes in the model's representations, reducing the chance of fabricating pedestrians not present in the image.

✅ WildVision Arena Evaluation: Real-world human preference evaluation via pairwise battles would reveal whether the model's driving judgment aligns with human expectations, unlike static benchmarks.

📈 Overall Progress

The field has evolved from basic capability cataloging to sophisticated diagnostic evaluation and mechanistic understanding. Early work (2023) established foundational benchmarks, but 2024 brought a paradigm shift toward live human-preference arenas and process-aware evaluation. The 2025 RL revolution introduced visually-grounded training methods that directly address the core problem of VLMs ignoring visual evidence. By 2026, research has converged on internal representation analysis for both improving capabilities and defending against attacks, while frontier benchmarks continue to expose fundamental gaps in spatial reasoning, visual tracking, and safety-critical deployment.

📂 Sub-topics

VLM Benchmarking & Evaluation

220 papers

Creating rigorous, diverse benchmarks to evaluate VLM capabilities across reasoning, perception, spatial understanding, cultural knowledge, and domain-specific tasks, while addressing shortcomings like shortcut learning and inflated scores.

Human-preference arenas Process evaluation Controlled stimuli generation Multi-dimensional metrics

Multimodal Reinforcement Learning & Alignment

120 papers

Applying reinforcement learning techniques—particularly GRPO and its variants—to improve VLM reasoning, reward modeling, and human preference alignment, including novel reward model architectures.

GRPO variants Reward model training Cold-start SFT + RL Process reward models

Hallucination Detection & Factuality

80 papers

Identifying, measuring, and mitigating hallucinations in multimodal models—where generated text contradicts visual evidence or world knowledge—through mechanistic analysis, spectral filtering, and multi-agent verification.

Spectral filtering Atomic decomposition Multi-agent verification Process-aware evaluation

Safety, Robustness & Adversarial Analysis

70 papers

Evaluating and defending VLMs against jailbreak attacks, adversarial inputs, and safety failures across text and visual modalities, including red-teaming frameworks and defense mechanisms.

Typography-based jailbreaking Activation shift calibration Multi-modal linkage attacks Safety fine-tuning

Domain-Specific Multimodal Analysis

167 papers

Adapting and evaluating multimodal models for specialized domains including medical imaging, autonomous driving, agriculture, remote sensing, and scientific applications.

3D-native VLM pretraining Domain adaptation Fine-grained evaluation frameworks Knowledge-enhanced foundation models

💡 Key Insights

💡 Visual reasoning does not scale with model size—specialized objectives matter more than parameters.

💡 Live human-preference arenas achieve >0.94 correlation with human judgment, far surpassing static benchmarks.

💡 Reinforcement learning with visual-token awareness yields 10-19% gains over standard GRPO on multimodal tasks.

💡 VLMs frequently hallucinate because language priors overwhelm visual evidence as sequence length grows.

💡 Safety alignment degrades dramatically when visual modality is added, enabling cross-modal jailbreaks.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from evaluating 'what VLMs can do' to understanding 'why they fail' through mechanistic interpretability, diagnostic benchmarks, and representation-level analysis, while simultaneously developing RL-based training methods that ground reasoning in visual evidence.

2023-03 to 2023-12 Foundational multimodal benchmarks and early safety analysis

(MM-SafetyBench, 2023) discovered typography-based visual jailbreaking, establishing the first multimodal safety evaluation framework.
(MM-Vet, 2023) introduced capability integration evaluation, defining 6 core VL capabilities and their 16 combinations.
(LAMM, 2023) created the first open-source instruction tuning dataset including 3D point clouds alongside images.
(MRECG, 2023) and (PCR, 2023) enabled efficient deployment of diffusion and vision models.

2024-01 to 2024-12 Scaling evaluation to real-world complexity and human preferences

(WildVision, 2024) launched the first live VLM arena with Elo ratings achieving 0.94 correlation with human preferences.
(VisionArena, 2024) scaled to 230K real-world conversations across 45 VLMs and 138 languages.
(UniBench, 2024) consolidated 53 benchmarks revealing that reasoning capabilities do not scale linearly like recognition.
Spider2-V (Spider2-V, 2024) introduced full-stack data science agent evaluation where GPT-4V achieved only 14% success.
InternVL3 (InternVL3, 2025) pioneered native multimodal pre-training, achieving 72.2 on MMMU and setting a new SOTA for open-source MLLMs.
(Merlin, 2024) introduced 3D-native CT vision-language pretraining, outperforming 2D baselines by +32.1% F1 on findings classification.

🔀 Shift from static benchmarks to live human-preference arenas (WildVision, VisionArena) and from answer-only evaluation to process-aware assessment.

2025-01 to 2025-12 Reinforcement learning revolution and diagnostic evaluation

(VPPO, 2025) introduced token-level visual perception masking for RL, achieving +19.2% improvement across eight benchmarks.
(VisuLogic, 2025) showed SOTA models achieve near-random performance on visual logic problems that resist language shortcuts.
SophiaVL-R1 (SophiaVL-R1, 2025) introduced thinking reward models that score entire reasoning processes rather than step-by-step.
(CoreCognition, 2024) exposed that models fail at rudimentary developmental tasks while excelling at complex reasoning.
(MM-MATH, 2024) introduced process evaluation via LMM-as-a-judge, revealing diagram misinterpretation accounts for >50% of errors.

🔀 Emergence of visually-grounded RL methods (VPPO, AT-RL) that focus training on tokens with high visual dependency, alongside diagnostic benchmarks exposing fundamental VLM limitations.

2026-01 to 2026-03 Mechanistic understanding, advanced defense, and frontier evaluation

(Latent CoT, 2026) internalized chain-of-thought reasoning into efficient discriminative reward models, surpassing GPT-5 by +9.6%.
(AT-RL, 2026) used graph-based anchor token identification for precise credit assignment in multimodal RL.
(JRS-Rem, 2026) proposed representation-space defense reducing jailbreak success from 84% to 15%.
(DatBench, 2026) achieved 13x evaluation speedup while improving discriminative power through data-centric curation.
(LabShield, 2026) revealed a 32% performance drop when moving from text MCQs to embodied laboratory safety scenarios.
(VET-Bench, 2026) proved visual entity tracking is NC1-complete and Molmo2-SGCoT achieved >90% accuracy where frontier models scored near random.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Visually-Grounded Reinforcement Learning for VLMs	Measure each token's visual dependency via attention or KL divergence, then weight RL policy updates to prioritize visually-grounded reasoning paths.	Improves on standard GRPO by +19.2% average accuracy on Qwen2.5-VL-7B across eight multimodal benchmarks (VPPO), and +8.24 points on five math benchmarks (AT-RL).	Spotlight on Token Perception for... (2025), Credit Where It's Due: Cross-Modality... (2026), Advancing Multimodal Reasoning via Reinforcement... (2025)
Generative & Latent Chain-of-Thought Reward Models	Train reward models to generate explanations alongside scores, then discard the generation head at inference to retain internalized reasoning in an efficient discriminative scorer.	Latent CoT achieves 85.1% accuracy on EditReward-Bench, surpassing GPT-5 (75.5%) by +9.6 points; EditScore-72B surpasses GPT-4o (84.41%) with 86.36% accuracy.	Joint Reward Modeling (2026), EditScore (2025), Skywork-VL Reward (2025)
Human-Preference Arena Evaluation	Deploy anonymous VLM battles in live platforms where users vote on preferred responses, converting pairwise wins into statistically robust Elo rankings.	WildVision-Bench achieves 0.94 Spearman correlation with human Elo ratings; VisionArena-Bench achieves 0.973 Spearman correlation, surpassing WildVision-Bench (0.802).	WildVision (2024), VisionArena (2024), CapArena (2025)
Hallucination Suppression via Internal Representation Analysis	Identify hallucination-prone directions in the model's representation space via eigendecomposition or probing, then dampen those specific modes in network weights or activations.	SRF achieves SOTA faithfulness on POPE and MSCOCO with zero inference latency overhead; VIB-Probe improves M-HalDetect AUROC by +2.84% over baselines.	Suppressing VLM Hallucinations with Spectral... (2025), VIB-Probe (2026), Multimodal large language models excel... (2024)
Multi-Modal Safety Jailbreak & Defense	Harmful content encoded in images (via typography or metaphorical encryption) bypasses text-aligned safety filters; defenses identify and subtract the jailbreak-specific activation shift in representation space.	Multi-Modal Linkage achieves 99.40% attack success rate on GPT-4o, improving over baselines by +66.4%; VLGuard Mixed Fine-Tuning reduces attack success from 53.6% to 1.1%.	MM-SafetyBench (2023), Jailbreak Large Vision-Language Models Through... (2024), Safety Fine-Tuning at (Almost) No... (2024), Understanding and Defending VLM Jailbreaks... (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MMMU (Massive Multi-discipline Multimodal Understanding)	Accuracy	72.2%	InternVL3 (2025)
MathVista (Mathematical Visual Reasoning)	Accuracy	73.4%	Advancing Multimodal Reasoning via Reinforcement... (2025)
EditReward-Bench	Accuracy	86.36%	EditScore (2025)
VisionArena-Bench (Human Preference Correlation)	Spearman Correlation	0.973	VisionArena (2024)
MM-SafetyBench (Attack Success Rate Reduction)	Attack Success Rate (lower is better)	1.1% ASR (down from 53.6%)	Safety Fine-Tuning at (Almost) No... (2024)

⚠️ Known Limitations (4)

Benchmark saturation and shortcut learning: Many benchmarks allow models to score well using text priors or guessing strategies without genuine visual understanding, inflating capability estimates. (affects: Human-Preference Arena Evaluation, VLM Benchmarking & Evaluation)
Potential fix: Convert multiple-choice to generative evaluation, filter questions solvable without visual input, and use controlled stimuli (CIVET, VisuLogic) that resist language shortcuts.
Reasoning-hallucination trade-off: Longer reasoning chains improve logical inference but degrade visual grounding, causing models to 'forget' the image as they reason more. (affects: Visually-Grounded Reinforcement Learning for VLMs, Hallucination Suppression via Internal Representation Analysis)
Potential fix: Use visual anchoring during reasoning (VAPO), moderate reasoning length via difficulty-aware budgets, or apply spectral filtering to maintain visual grounding.
Cross-modal safety gap: Adding a vision modality weakens the safety alignment of the underlying LLM, and current defenses remain fragile against sophisticated multi-modal attacks. (affects: Multi-Modal Safety Jailbreak & Defense)
Potential fix: Disentangle safety-relevant activation shifts from semantic shifts (ShiftDC), identify and project out jailbreak representation directions (JRS-Rem), or include safety data during visual instruction tuning.
Domain transfer gap: Models achieving strong general-purpose performance often fail catastrophically on specialized domains (medical, scientific, agricultural) where fine-grained visual details and domain knowledge are critical. (affects: Domain-Specific Multimodal Analysis)
Potential fix: Use domain-specific pretraining with expert data (KeepFIT, Merlin), construct specialized instruction-tuning datasets, or develop hybrid systems combining VLMs with domain-specific tools.

📚 View major papers in this topic (10)

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (2025-04) 9
Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models (2026-02) 9
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences (2024-06) 9
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (2024-07) 9
Spotlight on Token Perception for Multimodal Reinforcement Learning (2025-10) 8
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset (2024-06) 9
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage (2024-11) 9
Can Vision-Language Models Solve the Shell Game? (2026-03) 9
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models (2025-04) 8
Kimi K2.5: Visual Agentic Intelligence (2026-02) 9

💡 Another cross-cutting theme examines Benchmark.

🔬

Benchmark

What: Research on constructing evaluation benchmarks, datasets, and metrics for systematically assessing multimodal models across vision-language understanding, reasoning, safety, and domain-specific tasks.

Why: Without rigorous, standardized evaluation, inflated benchmark scores and hidden model failures impede trustworthy deployment of multimodal AI in real-world applications.

Baseline: Early evaluation relied on narrow Visual Question Answering (VQA) datasets like VQAv2 with exact-match metrics, testing single-image perception in controlled settings.

Existing benchmarks suffer from data contamination, language shortcuts, and multiple-choice guessing that inflate model scores
Models pass high-level reasoning benchmarks yet fail rudimentary perception tasks like counting and spatial reasoning
Evaluating open-ended, multi-modal, multi-turn interactions with subjective human preferences remains unsolved

🧪 Running Example

❓ Given a satellite image of a flood-affected area, count the destroyed buildings and suggest optimal rescue routes.

Baseline: A standard VLM benchmark would test this with a single multiple-choice question on a clean image, allowing the model to guess from textual cues without truly perceiving the scene.

Challenge: This example requires multi-modal reasoning (satellite imagery + geospatial context), fine-grained perception (counting small objects), domain expertise (disaster response), and compositional reasoning (route planning) — capabilities that no single existing benchmark adequately tests.

✅ Domain-Specific Multi-Task Benchmarking: DisasterM3 provides bi-temporal satellite image pairs with 9 distinct tasks (from recognition to rescue routing), testing each capability independently to pinpoint failures.

✅ Circular Evaluation with LLM Judging: MMBench's CircularEval shuffles answer choices across multiple passes, eliminating positional bias and ensuring the model genuinely understands the scene rather than exploiting option patterns.

✅ Human Preference Arena Evaluation: VisionArena collects real-world user interactions and preference votes to rank models by genuine helpfulness, capturing open-ended quality that static benchmarks miss.

📈 Overall Progress

The benchmarking landscape has undergone a paradigm shift from narrow, single-capability VQA datasets to comprehensive, multi-dimensional evaluation ecosystems. Early work focused on establishing foundational benchmarks with robust anti-cheating mechanisms (CircularEval, blind baselines). The field then scaled to video, long-context, and domain-specific professional evaluation while incorporating live human preference signals. Most recently, research has turned to probing fundamental cognitive limitations, interactive safety in embodied settings, and culturally diverse evaluation, revealing that even frontier models fail at basic perception tasks that are trivial for humans.

📂 Sub-topics

General VLM Capability Benchmarks

95 papers

Comprehensive benchmarks evaluating core vision-language capabilities including perception, reasoning, OCR, spatial understanding, and knowledge integration across diverse tasks and formats.

CircularEval Capability Integration Scoring Arena-based Ranking

Spatial, 3D, and Embodied Understanding Benchmarks

80 papers

Benchmarks targeting spatial perception, 3D scene understanding, embodied navigation, and physical reasoning — capabilities where VLMs consistently underperform humans despite strong general reasoning.

Psychometric BSA Framework Ego-centric Evaluation Multi-level Skill Decomposition

Safety, Robustness, and Hallucination Benchmarks

75 papers

Evaluations of model reliability under adversarial attacks, hallucination detection, safety alignment, privacy risks, and robustness to visual corruption or textual misinformation.

Typography-Based Jailbreaking Process-Oriented Safety Evaluation Multi-Dimensional Safety Scoring

Domain-Specific Benchmarks and Datasets

120 papers

Specialized benchmarks for medicine, remote sensing, autonomous driving, agriculture, telecom, and other professional domains where general-purpose VLMs fail due to domain gaps.

Domain VLM Fine-tuning Expert-Annotated Evaluation Multi-Sensor Fusion

Large-Scale Dataset Construction and Instruction Tuning

80 papers

Methods for constructing high-quality multimodal training datasets at scale, including synthetic data generation, instruction tuning data curation, and data quality filtering pipelines.

GPT-Assisted Data Generation Self-Training Synthesis Arena-Based Data Collection

Reasoning, Mathematical, and Cognitive Benchmarks

60 papers

Benchmarks probing abstract reasoning, mathematical problem-solving, chain-of-thought quality, and core cognitive abilities in multimodal models, revealing fundamental gaps between model and human intelligence.

Process Evaluation Parallel Domain Diagnostic Visual Math Decomposition

💡 Key Insights

💡 VLMs exploit textual shortcuts — many score higher without visual input than with it.

💡 56% of VLM reasoning failures trace to perception deficits, not logic errors.

💡 Low-level cognitive abilities show zero improvement with model scaling.

💡 Live arena evaluation outperforms static benchmarks in predicting human preference.

💡 Domain-specific benchmarks reveal 30–50% capability gaps versus general evaluation.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

The evolution follows a clear trajectory: from testing 'what models can do' (capability benchmarks) to testing 'what models cannot do' (cognitive profiling) to testing 'what models should not do' (safety and privacy evaluation), with increasing emphasis on ecological validity through real-world data and interactive environments.

2023-02 to 2023-12 Foundation benchmarks and large-scale dataset construction for the VLM era

(Visual Instruction Tuning, 2023) pioneered GPT-assisted visual instruction data generation and the visual instruction tuning paradigm.
(MMBench, 2023) introduced CircularEval and LLM-based answer extraction, establishing the gold standard for robust VLM evaluation.
(MM-Vet, 2023) defined capability integration evaluation across 6 core VL skills with open-ended LLM scoring.
(MVBench, 2023) systematically extended static image tasks to 20 dynamic video tasks, filling the temporal understanding evaluation gap.
ShareGPT4V (ShareGPT4V, 2023) demonstrated that high-quality detailed captions scale VLM performance, gaining +36.1 points on MME.
(MIMIC-IT, 2023) introduced multi-modal in-context instruction tuning with 2.8M samples across images and videos.
(MM-SafetyBench, 2023) discovered typography-based visual jailbreaking, exposing critical safety vulnerabilities in VLMs.

🔀 Shift from narrow VQA evaluation to comprehensive multi-capability benchmarking with LLM-as-judge evaluation, establishing the modern VLM evaluation paradigm.

2024-01 to 2024-12 Scaling evaluation to video, long-context, domain-specific, and safety-critical scenarios

(Video-MME, 2024) created the first full-spectrum video benchmark spanning short to long durations with subtitle/audio integration.
(WildVision, 2024) launched the first Chatbot Arena for VLMs, achieving 0.94 Spearman correlation with human preferences.
(MathVerse, 2024) exposed that VLMs score higher on text-only versions of visual math problems, proving reliance on textual shortcuts.
Spider2-V (Spider2-V, 2024) tested full-stack data science workflows in live VMs, where GPT-4V achieved only 14% success rate.
(Merlin, 2024) established 3D-native medical VLM evaluation published in Nature, achieving +16% F1 zero-shot over supervised baselines.
(CoreCognition, 2024) revealed VLMs show a reversed capability curve — failing low-level tasks that improve with human development.
(VisionArena, 2024) scaled arena evaluation to 230K real conversations, achieving 97.3% correlation with live leaderboards.
Pixtral-12B (Pixtral 12B, 2024) introduced RoPE-2D for native variable-resolution processing, outperforming 7x larger models on MMMU.

🔀 Transition from single-image benchmarks to multi-modal, multi-turn, real-world evaluation including live human preference arenas and domain-specific professional tasks.

2025-01 to 2026-02 Probing fundamental limitations, interactive safety, and culturally diverse evaluation

(IS-Bench, 2025) introduced process-oriented interactive safety evaluation, showing all SOTA agents achieve <40% safe success rate.
(SPINBENCH, 2025) demonstrated strong egocentric bias in spatial reasoning — models fail allocentric perspective taking entirely.
(AHELM, 2025) created the first holistic audio-language model benchmark spanning diverse audio understanding tasks.
(EditScore, 2025) achieved 86.36% accuracy on image editing reward evaluation, surpassing GPT-4o and GPT-5.
(DatBench, 2026) achieved 13x evaluation speedup through data-centric curation, correcting inflated MCQ scores by ~35 points.
(VRIQ, 2026) attributed 56% of VLM failures to perception-only deficits via parallel domain diagnostic benchmarking.
(VLM-GeoPRIVACY, 2026) revealed that GPT-5 over-discloses sensitive location data 47.6% of the time.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Arena-Based Human Preference Evaluation	Users chat with two anonymous models simultaneously and cast preference votes, producing statistically robust Elo rankings from thousands of real-world interactions.	Improves on static benchmarks like MMBench by achieving 0.94–0.97 Spearman correlation with live human Elo ratings, versus 0.80 for prior automated evaluations.	WildVision (2024), VisionArena (2024), CapArena (2025)
Robust Anti-Shortcut Benchmark Design	Circular evaluation shuffles answer choices across multiple passes, while blind baselines verify models cannot solve questions without visual input, ensuring genuine multimodal understanding.	DatBench achieves 13x evaluation speedup and reveals ~35 point accuracy drop on AI2D when converting MCQ to generative format, correcting inflated capability estimates from prior benchmarks.	MMBench (2023), MathVerse (2024), DatBench (2026)
Hierarchical Cognitive Capability Profiling	Adapts established human cognitive frameworks (like Piaget's developmental stages or Gardner's Multiple Intelligences) to create hierarchical VLM diagnostics that isolate perception, attention, and reasoning failures.	Reveals that 56% of VLM failures stem from perception deficits, not reasoning, and that core cognitive abilities show no improvement with model scaling, unlike prior holistic benchmarks that masked these patterns.	Core Knowledge Deficits in Multi-Modal... (2024), Defining and Evaluating Visual Language... (2025), VRIQ (2026)
Domain-Adaptive Multi-Task Benchmark Suites	Integrates domain expert knowledge into benchmark construction with multi-granularity tasks spanning basic recognition to complex reasoning, using professional-grade data sources unavailable in web-scraped training sets.	Achieves +16% F1 zero-shot on findings classification over supervised training (Merlin on CT scans), while AgroBench reveals open-source models score only 30% on weed identification versus GPT-4o's 79%.	Merlin (2024), Spider2-V (2024), AgroBench (2025)
Process-Oriented Interactive Safety Evaluation	Triggers safety checks immediately before or after risk-prone actions in interactive environments, detecting intermediate unsafe behaviors that termination-based evaluation misses entirely.	Reveals that GPT-4o achieves <40% Safe Success Rate on IS-Bench and over-discloses location 47.6% of the time on VLM-GeoPRIVACY, failures invisible to prior binary Attack Success Rate metrics.	MM-SafetyBench (2023), IS-Bench (2025), VLM-GeoPRIVACY (2026)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MMBench	Accuracy with CircularEval (all shuffled passes must be correct)	Top-tier models achieve ~85% accuracy	MMBench (2023)
Video-MME	Accuracy (multiple-choice)	81.3% with subtitles (Gemini 1.5 Pro)	Video-MME (2024)
MathVerse	Accuracy across Vision-Only to Text-Dominant problem versions	GPT-4V demonstrates the best visual comprehension but still drops significantly without text cues	MathVerse (2024)
VisionArena-Bench	Spearman correlation with live Arena Elo ratings	97.3% Spearman correlation with live leaderboard	VisionArena (2024)
Spider2-V	Task Success Rate	14.0% success rate (GPT-4V)	Spider2-V (2024)

⚠️ Known Limitations (4)

Data contamination and benchmark saturation: models may have seen test data during pretraining, inflating scores without genuine capability improvement. (affects: Robust Anti-Shortcut Benchmark Design, Arena-Based Human Preference Evaluation)
Potential fix: Multi-modal semantic perturbations can detect contamination without training data access; temporal separation and continuous benchmark renewal reduce leakage risk.
Evaluation cost and scalability: comprehensive benchmarks require expensive LLM-based judging, human preference collection, or interactive simulation environments that are slow and expensive to run. (affects: Arena-Based Human Preference Evaluation, Process-Oriented Interactive Safety Evaluation)
Potential fix: Data-centric subset selection achieves 13x speedup; automated judges with high human correlation reduce reliance on manual evaluation.
Cultural and linguistic bias: the vast majority of benchmarks are English-centric and Western-focused, systematically underestimating model failures on non-Western content. (affects: Domain-Adaptive Multi-Task Benchmark Suites, Hierarchical Cognitive Capability Profiling)
Potential fix: Culturally-sourced benchmarks with native annotators and multilingual parallel corpora enable fair cross-cultural evaluation.
Gap between benchmark performance and deployment readiness: models can pass academic evaluations while failing catastrophically in interactive, safety-critical, or time-pressured real-world scenarios. (affects: Robust Anti-Shortcut Benchmark Design, Process-Oriented Interactive Safety Evaluation)
Potential fix: Process-oriented interactive evaluation with dynamic risk generation in simulators; reliability-focused benchmarks with corruption and text-only baselines to detect blind reasoning.

📚 View major papers in this topic (10)

Visual Instruction Tuning (2023-04) 9
MMBench: Is Your Multi-modal Model an All-around Player? (2023-07) 8
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences (2024-06) 9
VisionArena: 230K Real World User-VLM Conversations with Preference Labels (2024-12) 9
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? (2024-03) 9
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (2024-07) 9
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations (2026-01) 9
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks (2025-06) 9
SPINBENCH: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs (2025-09) 9

💡 Another cross-cutting theme examines Application.

🏆

Application

What: Research on deploying multimodal AI models—combining vision, language, and action—to solve real-world tasks across robotics, healthcare, driving, and specialized domains.

Why: Bridging the gap between general-purpose multimodal models and the domain-specific reliability, efficiency, and grounding required for practical deployment.

Baseline: General-purpose vision-language models applied zero-shot or with minimal adaptation to domain-specific tasks, often producing hallucinations and lacking actionable outputs.

Domain gap: general VLMs lack specialized knowledge for fields like medicine, agriculture, and telecommunications
Deployment efficiency: large models are too slow and resource-heavy for real-time edge applications like robotics and driving
Evaluation realism: existing benchmarks use clean, curated data that masks failures on noisy, multilingual, real-world inputs

🧪 Running Example

❓ A hospital radiologist asks an AI system: 'Analyze this chest X-ray, identify any abnormalities, and draft a preliminary report with findings.'

Baseline: A general-purpose VLM like GPT-4V would describe the image at a high level ('a chest X-ray showing lungs') but miss subtle pathological cues like microaneurysms or small nodules, hallucinate non-existent findings, and produce vague reports lacking clinical terminology—failing the radiologist's need for precise, actionable diagnostic support.

Challenge: This example illustrates three key challenges: (1) the perception gap—general visual encoders miss fine-grained lesions, (2) the reasoning gap—language priors override weak visual signals, causing hallucinations, and (3) the deployment gap—cloud-based models introduce latency, cost, and privacy concerns incompatible with clinical workflows.

✅ Domain-Adaptive VLM Post-Training: AdaMLLM generates synthetic medical instruction data from open-source models, then fine-tunes the VLM in a single stage to learn radiology-specific terminology and visual patterns, improving VQA-RAD accuracy by +4.6% over prior medical VLMs.

✅ Agentic Multi-Step Visual Processing: Meissa distills frontier model behavior into a compact 4B-parameter agent that runs on-premise, matching GPT-4o in 10 of 16 medical benchmarks while reducing latency by 22x and eliminating cloud API privacy risks.

✅ R1-Style RL for Visual Reasoning: RARL uses GRPO reinforcement learning with reasoning-aware rewards to teach the model explicit diagnostic chain-of-thought steps, improving generalization to unseen clinical datasets by ~27% over supervised fine-tuning.

✅ Scale-then-Compress Efficient VLM Architecture: NVILA first scales up image resolution to capture fine radiological details, then compresses visual tokens to enable real-time inference—reducing prefilling latency by 1.6–2.2x while maintaining diagnostic accuracy.

📈 Overall Progress

The field evolved from exploratory demonstrations of multimodal capabilities (GPT-4V, 2023) through rigorous domain-specific adaptation and evaluation (2024–2025) to production-ready agentic systems and lightweight models matching frontier performance (2025–2026). A key paradigm shift occurred when reinforcement learning—particularly GRPO with rule-based rewards—was applied to VLMs, enabling dramatic reasoning improvements without learned reward models. Simultaneously, real-world benchmarks consistently exposed that even the best models achieve only 50–60% accuracy under realistic conditions, driving development of specialized, efficient deployment solutions.

📂 Sub-topics

Robotics & Embodied AI

14 papers

Vision-Language-Action (VLA) models combined with reinforcement learning for robotic manipulation, autonomous drone flight, and deployment-time reliability in unstructured real-world environments.

Vision-Language-Action-Critic (VLAC) World Model-based Policy Optimization (WMPO) RL-100 Unified Framework

Healthcare & Biomedical AI

15 papers

Adapting multimodal models for clinical applications including radiology report generation, dermatology diagnosis, ophthalmology, and lightweight medical agentic systems that comply with privacy constraints.

Reasoning-Aware RL (RARL) Deep Expert Injection Meissa Agentic Intelligence

Autonomous Driving & Transportation

12 papers

VLM-based perception, reasoning, and planning for autonomous vehicles, including fine-grained evaluation benchmarks, long-tail data curation, and visual chain-of-thought for driving theory.

Hierarchical Fine-Grained Evaluation (VLADBench) Neuro-Symbolic Data Mining (Semantic-Drive) Retrieval-Based Visual CoT

Domain-Specific VLM Adaptation

35 papers

Methods for adapting general-purpose VLMs to specialized domains including agriculture, materials science, telecommunications, chart understanding, scientific visualization, and wildlife conservation.

Domain-Adaptive Post-Training (AdaMLLM) Structure-Aware Multimodal Bootstrapping Iterative Self-Training

Benchmarks & Real-World Evaluation

25 papers

New evaluation paradigms testing VLMs on high-resolution, noisy, multilingual, and domain-specific real-world scenarios that consistently expose 40–50% accuracy gaps between current models and human performance.

High-Resolution Human-Annotated Benchmarking Multi-Difficulty Noise Evaluation Egocentric Wearable Evaluation

Efficient Models & Deployment

18 papers

Techniques for making multimodal models practical—including quantization, compact VLMs, parameter-efficient fine-tuning, and unified training infrastructure—to enable edge deployment and reduce costs.

Scale-then-Compress Architecture (NVILA) Compact VLM via Small LM Integration One-Stop Training Infrastructure (SWIFT)

Agentic Systems & Visual Reasoning

25 papers

Multi-step agent pipelines, tool-use benchmarks, RL-enhanced visual reasoning, and agentic frameworks that orchestrate multiple models to solve complex real-world tasks with self-correction and reflection.

R1-style Visual RL (VLM-R1) Agentic Quality-Driven SR (4KAgent) Generative Engine Optimization

Multimodal Infrastructure & Signal Processing

18 papers

Hardware designs for mm-wave 5G communications, multimodal sensor fusion, image denoising, joint source-channel coding, and foundational infrastructure enabling multimodal AI deployment at scale.

Wideband RIS Design Multi-Modal Sensor Fusion Digital-to-Analog JSCC

💡 Key Insights

💡 Real-world benchmarks expose 50–60% accuracy ceiling for even the best multimodal models

💡 Rule-based RL (GRPO) enables small VLMs to surpass models 10–30x their size

💡 Domain-adaptive post-training with open-source data outperforms GPT-4-based methods

💡 Agentic multi-step pipelines match frontier model performance at 20–90x lower cost

💡 Visual noise and non-English languages cause 35%+ performance degradation in VLMs

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has shifted from 'can VLMs do X?' to 'how reliably, efficiently, and safely can VLMs do X in the real world?'—driving three converging trends: domain-specific adaptation with open-source data, real-world evaluation rigor, and agentic multi-step orchestration with lightweight on-premise models.

2023-02 to 2024-06 Foundation exploration and early domain applications

Comprehensive GPT-4V exploration (The Dawn of LMMs, 2023) systematically documented LMM capabilities across domains including medical imaging, celebrity recognition, and abstract visual reasoning
CLIP2 (Contrastive Language-Image-Point Pretraining, 2023) bridged 2D vision-language models to 3D point cloud understanding with +253% improvement on outdoor recognition
SQNR-based mixed precision quantization (Practical Mixed Precision Algorithm, 2023) recovered BERT accuracy from 74.13% to 82.97% via label-free layer sensitivity analysis
(Efficient Multi-Modal Assistant, 2024) proved 2.7B-parameter models could compete with 7B+ VLMs, opening the path to edge deployment
STAR benchmark (Situated Reasoning in Real-World Videos, 2024) exposed a ~50% gap between machine and human situated reasoning ability

🔀 GPT-4V demonstrated that large multimodal models could perform human-level reasoning across diverse visual domains, catalyzing an explosion of application-oriented research.

2024-07 to 2025-06 Domain specialization and real-world evaluation rigor

MME-RealWorld (Benchmark for MLLM in the..., 2025) revealed that even GPT-4o fails to surpass 60% accuracy on real-world high-resolution tasks with human-annotated ground truth
NVILA (Efficient Visual Language Models, 2024) introduced scale-then-compress architecture achieving +30% accuracy while cutting training costs by up to 5.1x
(Model-Based, 2025) achieved the first end-to-end pixel-to-command autonomous drone flight via learned world models with 100% simulation success
VLM-R1 (R1-style Visual RL, 2025) pioneered applying GRPO with rule-based rewards to visual tasks, enabling 3B models to surpass 7B baselines
(Domain-Adaptive, 2024) established a systematic open-source domain adaptation pipeline outperforming GPT-4-based medical VLMs

🔀 Research shifted from proving VLMs could do tasks to rigorously evaluating where they fail in the real world, spawning a wave of specialized benchmarks and domain-adaptive methods.

2025-07 to 2026-03 Agentic deployment and industrial-scale real-world applications

RL-100 (Real-World, 2025) achieved 100% success rate across 1000 evaluations including continuous 7-hour zero-failure operation in a public shopping mall
(Medical Agentic Intelligence, 2026) demonstrated a 4B-parameter agent matching GPT-4o across medical benchmarks at 22x lower latency
4(Agentic 4K Super-Resolution, 2025) introduced perception-restoration agent pipeline setting new SOTA across 11 task categories including medical and satellite imaging
(VLM, 2025) exposed 35% performance drop from visual noise and severe English-first bias across 24 languages
(Pinterest GEO, 2026) deployed VLM agents at production scale achieving 20% traffic growth across billions of images at 94x lower cost

🔀 The field transitioned from single-model solutions to multi-agent orchestration and deployment-focused systems, with lightweight models matching frontier model capabilities at a fraction of the cost.

🔬 Key Methods

Method	Key Innovation	Improves On	Papers
Vision-Language-Action Reinforcement Learning	Unify imitation and reinforcement learning under VLA architectures, using world models or dense vision-language critics to enable safe, sample-efficient real-world robot learning.	Improves on Diffusion Policy (DP3) baseline by +32.2% mean success rate (100% vs 67.8%) across 8 real-world manipulation tasks, achieving continuous 7-hour zero-failure operation (RL-100).	RL-100 (2025), Dream to Fly (2025), A Vision-Language-Action-Critic Model for Robotic... (2025), WMPO (2025)
Domain-Adaptive VLM Post-Training	Generate domain-specific visual instruction data using open-source pipelines, then fine-tune VLMs with progressive curricula spanning captioning, VQA, and reinforcement learning.	Improves on LLaVA-Med (GPT-4 generated) by +4.6% on VQA-RAD using only open-source models (AdaMLLM); MatterChat outperforms GPT-4o on formation energy estimation for novel materials.	On Domain-Adaptive Post-Training for Multimodal... (2024), AgriGPT-VL (2025), MatterChat (2025), MM-Telco (2025)
R1-Style RL for Visual Reasoning	Use tasks with deterministic answers (bounding boxes, exact matches) as rule-based rewards in GRPO, enabling stable RL training that improves VLM reasoning and out-of-domain generalization.	Improves on Supervised Fine-Tuning by +8.34 points on LISA-Grounding (63.16 vs 54.82) with 3B model surpassing 7B baseline on OVDEval (31.01 vs 29.08); RARL achieves +27% on unseen medical datasets.	VLM-R1 (2025), RARL (2025), UAV-VL-R1 (2025), Are Video Reasoning Models Ready... (2026)
Agentic Multi-Step Visual Processing	Decompose complex visual tasks into perception, planning, and execution stages with reflection, rollback, and quality-driven expert routing for robust, interpretable processing.	Meissa (4B parameters) matches GPT-4o in 10/16 medical settings with 25x fewer parameters and 22x lower latency; 4KAgent sets new state-of-the-art on RealSR benchmarks across 11 task categories.	Meissa (2026), 4KAgent: Agentic Any Image to... (2025), Generative Engine Optimization (2026), GTA (2024)
Scale-then-Compress Efficient VLM Architecture	Increase input resolution and frame count for maximum information capture, then apply spatial-to-channel reshaping, token pruning, or model distillation to compress representations for efficient processing.	Improves on VILA baseline by +30% accuracy on text-heavy benchmarks while reducing training costs by 1.9–5.1x and prefilling latency by 1.6–2.2x (NVILA); LLaVA-Phi (3B) outperforms 7B+ models on ScienceQA with 71.4% accuracy.	NVILA (2024), LLaVA-Phi (2024), SWIFT (2024)

📊 Benchmark Results

Benchmark	Metric	Best Result	Paper
MME-RealWorld	Accuracy (%)	<60% (GPT-4o / Gemini 1.5 Pro)	MME-RealWorld (2025)
MirageTVQA (Noisy Multilingual Tables)	Exact Match (%)	25.52% EM clean / 16.50% EM noisy (Qwen2.5-VL-72B)	Lost in Translation and Noise:... (2025)
Real-World Robotic Manipulation (RL-100)	Success Rate (%)	100% success rate across all tasks	RL-100 (2025)
LISA-Grounding (Out-of-Domain Visual Grounding)	Grounding Score	63.16 (Qwen2.5-VL-3B + VLM-R1)	VLM-R1 (2025)

⚠️ Known Limitations (4)

Domain gap persistence: even after adaptation, VLMs hallucinate domain-specific details (e.g., medical findings, rare species) because pre-trained visual encoders lack fine-grained domain features like microaneurysms or subtle crop diseases (affects: Domain-Adaptive VLM Post-Training, R1-Style RL for Visual Reasoning)
Potential fix: Dual-stream encoding with specialized domain encoders fused via learned gates, as demonstrated by Deep Expert Injection achieving +12.55% precision improvement over simple addition
Robustness to real-world degradation: models trained on clean data suffer catastrophic performance drops (35%+) when facing noisy, low-light, blurred, or compressed inputs typical of deployment environments (affects: Scale-then-Compress Efficient VLM Architecture, Domain-Adaptive VLM Post-Training)
Potential fix: ROVA-style robustness training with structured spatio-temporal corruptions and consistency rewards between clean and perturbed branches, boosting perturbed accuracy by 24%+
Evaluation-deployment mismatch: benchmarks using clean data, multiple-choice formats, and English-only content overestimate real-world capabilities, especially for safety-critical domains like autonomous driving and medicine (affects: Vision-Language-Action Reinforcement Learning, Agentic Multi-Step Visual Processing)
Potential fix: Hierarchical fine-grained benchmarks (like VLADBench with 29 tertiary tasks) combined with closed-loop real-world testing and deployment-time monitoring frameworks that detect distribution shift
Sample efficiency and safety in real-world RL: real-robot interactions are expensive and risky, and learned world models may not faithfully capture edge cases in unstructured environments (affects: Vision-Language-Action Reinforcement Learning)
Potential fix: World model-based imagination (WMPO) for safe off-robot policy learning, combined with runtime monitoring hierarchies and feasibility-aware task planning that maximizes joint success probability

📚 View major papers in this topic (10)

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning (2025-10) 9
Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight (2025-01) 9
NVILA: Efficient Visual Language Models from Pre-training to Deployment (2024-12) 9
Meissa: Multi-modal Medical Agentic Intelligence (2026-03) 9
MME-RealWorld: A Benchmark for MLLM in the Real World (2025-02) 9
Lost in Translation and Noise: A Deep Dive into Failure Modes of VLMs on Real-World Tables (2025-11) 9
4KAgent: Agentic Any Image to 4K Super-Resolution (2025-07) 9
Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth (2026-02) 9
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) (2023-09) 9
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model (2025-04) 8

💡 Another cross-cutting theme examines Survey.

📱

Survey

MM-LLMs: Recent Advances in MultiModal Large Language Models (2024-01) 9
Tutorial on Diffusion Models for Imaging and Vision (2024-03) 9
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey (2024-07) 9
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety (2025-02) 9
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers (2025-06) 9
Explain Before You Answer: A Survey on Compositional Visual Reasoning (2025-08) 9
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents (2025-08) 9
CRAG-MM: A Comprehensive Benchmark for Multi-modal Multi-turn Retrieval-Augmented Generation (2025-10) 9
MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning (2026-01) 9

🎯 Practical Recommendations

Priority	Recommendation	Evidence
High	Adopt reinforcement learning with verifiable rewards (GRPO) as the default post-training paradigm for multimodal models, as it consistently outperforms supervised fine-tuning by 10-30% across understanding, generation, and robotics tasks with minimal human annotation	R1-Zero showed a 2B GRPO-trained model outperforms SFT-tuned 72B models; SimpleVLA-RL achieves 91.7% from a single demo; Flow-GRPO boosts GenEval from 63% to 95%
High	Integrate grounded chain-of-thought reasoning into multimodal systems by requiring models to output bounding box coordinates alongside text reasoning, which reduces hallucination rates by 30-55% and improves answer-grounding consistency	GCoT revealed that even 72B models achieve only 11.1% grounding consistency despite 75.7% accuracy; grounded CoT improves consistency by +55.7%
High	Use dynamic visual token compression (pruning 50-76% of tokens) combined with resolution routing to achieve 4-10x inference speedup for production multimodal deployments with minimal accuracy loss	METEOR prunes 76% of tokens with only 0.3% accuracy drop; InternVL3.5 achieves 4.05x speedup scoring 77.7 on MMMU; AIM reduces FLOPs by 6.8x
High	Deploy compact sub-1B specialized models for document parsing and OCR tasks, as they match or exceed 100x-larger general-purpose VLMs on structured document understanding	GLM-OCR (0.9B) ranked first on OmniDocBench v1.5 outperforming GPT-5.2; olmOCR processes PDFs at 35x lower cost than GPT-4o
Medium	Use process reward models for step-level supervision during multimodal reasoning, as they improve performance across model scales by +5.9 points and enable effective test-time compute scaling via best-of-N selection	VisualPRM-8B improves even 78B models by +5.9 points; DreamPRM reaches 85.2% on MathVista; Athena-PRM achieves 83.1 F1 at 1/45th GPU cost
Medium	Adopt dual-system hierarchical architectures (fast + slow) for embodied AI applications requiring both high-level reasoning and real-time control, achieving 100+ Hz reactive control alongside VLM-quality planning	Fast-in-Slow achieves 117.7 Hz control and +11% over OpenVLA; OneTwoVLA achieves +30% on long-horizon tasks with autonomous mode switching
Medium	Evaluate multimodal models on process-level reasoning quality rather than just final answer accuracy, since correct answers frequently mask severe intermediate hallucinations — top models show 70.6% accuracy but only 22.8% thinking correctness	MM-THEBench revealed 70.6% answer accuracy but only 22.8% thinking correctness; GCoT found inverse scaling where larger models ground worse
Low	Leverage world models for safe robotic policy training in imagination before real-world deployment, reducing data requirements by 3-4x while enabling evaluation of candidate trajectories against predicted collisions	DINO-WM enables zero-shot planning with +45% success rate; PlayWorld improves real-world policy success by 65%; Kinematics-aware models achieve +23.1% mean return

🔑 Key Takeaways

🧠

RL Replaces Supervised Learning

Group Relative Policy Optimization (GRPO) has become the universal post-training paradigm, simultaneously transforming visual reasoning, image generation, video understanding, and robotic control. Small 2-7B models trained with GRPO consistently outperform 72B+ supervised counterparts, proving that RL teaches transferable reasoning while SFT merely memorizes patterns.

GRPO-trained small models outperform 10x-larger supervised ones everywhere.

👁️

Perception Is the Real Bottleneck

Across visual reasoning, video understanding, and robotics, 72-78% of errors stem from incorrect visual perception rather than flawed logic. Larger models paradoxically ground worse — 72B models show only 11.1% answer-grounding consistency despite 75.7% accuracy. This reveals that scaling alone cannot solve visual understanding.

Visual perception, not reasoning logic, causes most multimodal failures.

🎯

Grounded Reasoning Halves Hallucinations

Requiring models to produce bounding box coordinates and spatial evidence alongside text reasoning reduces hallucination rates by 30-55% and exposes the gap between correct answers and correct reasoning. Training-free attention interventions like OPERA can achieve +35.8% improvement without any model retraining.

Spatial grounding in reasoning chains cuts hallucinations in half.

🤖

Robots Achieve Near-Perfect Manipulation

RL post-training has broken the imitation learning ceiling for robotic manipulation, with systems achieving 99-100% success rates on real-world tasks and operating continuously for 7 hours in public environments. Dense process reward models enable one-shot adaptation from near-zero to 95% success with only 150 rollouts.

RL-enhanced robots achieve 100% real-world success rates continuously.

⚡

Compact Models Beat Giants

Across document parsing, GUI grounding, image generation, and robotic control, specialized compact models consistently outperform models 10-100x their size. A 0.9B document parser beats GPT-5.2, a 7B GUI agent outperforms 72B UI-TARS, and a 2.6B one-step generator surpasses 12B FLUX-dev.

Specialized sub-1B models consistently outperform 100x-larger generalists.

🎬

Generation Meets Understanding

The boundary between understanding and generation is dissolving. Unified models like MMaDA and Mogao jointly reason and generate, while chain-of-thought before generation improves compositional accuracy by 89-160%. Reasoning before acting has become the dominant paradigm from image creation to robotic control.

Reasoning before generating boosts quality 89-160% across modalities.

🚀 Emerging Trends

Test-time compute scaling allows small models to match much larger ones by spending additional computation at inference through evolutionary search, tree-based exploration, or process reward model-guided best-of-N selection

EvoSearch with Wan 1.3B matches the 10x larger Wan 14B model; VisVM-guided captions are preferred 74% over greedy decoding; DreamPRM reaches 85.2% on MathVista via best-of-N with o4-mini

Self-improving agentic systems that autonomously collect data, refine their own outputs, and learn from experience without human intervention are emerging across generation, editing, and robotics

SIDiffAgent uses Theory-of-Mind inspired self-improvement with +8.73% on GenAIBench; PlayWorld learns from autonomous robot play improving success by 65%; PLD's residual RL achieves self-improvement to 99% success

Unified understanding-generation models that jointly perceive and create across modalities are replacing separate specialized systems, with explicit reasoning bridging the cognitive gap between comprehension and synthesis

MMaDA surpasses autoregressive LLMs on reasoning while excelling at generation; ImageGen-CoT improves compositional accuracy by 89-160% via structured reasoning before generation; Mogao achieves 83.3% MME while enabling interleaved multi-modal generation

Physical world simulation via learned world models is enabling robots and autonomous vehicles to train entirely in imagination before real-world deployment, with explicit kinematics grounding and causal reasoning

DINO-WM enables zero-shot planning with frozen foundation model features; Kinematics-aware latent models reduce data needs by 4x; IRL-VLA eliminates sensor simulation via reward world models

Multi-scene narrative video generation and audio-visual joint synthesis are extending video generation from short single clips to minutes-long coherent storytelling with synchronized audio

Long Context Tuning generates coherent 20-shot 3-minute videos; Seedance 1.5 Pro achieves native joint audio-visual generation; COMIC produces fully automated comedy videos via multi-agent collaboration

🔭 Research Opportunities

Bridging the massive human-AI gap on abstract visual logic — top models achieve only 31.1% on VisuLogic versus 51.4% for humans, and near-random on tasks requiring genuine spatial reasoning beyond language shortcuts

Despite the RL revolution, fundamental visual perception and abstract reasoning remain dramatically weaker than human cognition. This gap limits deployment in safety-critical applications requiring reliable visual understanding.

Difficulty: High Impact: High

Developing robust multilingual and culturally-aware multimodal models — current systems show up to 30+ percentage point performance drops on non-Western cultural concepts and low-resource languages

The field is overwhelmingly English-centric with Western cultural bias embedded in training data, evaluation benchmarks, and model design, severely limiting global applicability.

Difficulty: Medium Impact: High

Solving physical plausibility in video generation — models produce visually stunning but physically impossible videos that violate gravity, object permanence, and fluid dynamics, limiting use in simulation and robotics

High visual fidelity scores mask fundamental physics violations. As video generation moves into world modeling for autonomous driving and robotics, physical accuracy becomes safety-critical.

Difficulty: High Impact: High

Creating unified cross-domain evaluation standards that test process-level reasoning quality rather than just final answer accuracy, exposing models that achieve correct answers through hallucinated reasoning

Current benchmarks are fragmented across incompatible metrics, and models achieving 70.6% accuracy show only 22.8% thinking correctness. Standard MCQ formats allow text-based elimination without genuine visual understanding.

Difficulty: Medium Impact: High

Scaling RL training efficiency for visual models — full trajectory sampling with large group sizes makes GRPO prohibitively expensive, limiting its accessibility to well-resourced labs

Despite GRPO's effectiveness, the computational cost of generating multiple candidate outputs per prompt creates significant barriers. Single-rollout and tree-structured approaches show promise but need further development.

Difficulty: Medium Impact: Medium

Long-horizon interleaved multi-modal generation that maintains quality beyond 20 visual events — current unified models collapse after approximately 20 generated images regardless of text token count

The event bottleneck phenomenon limits practical applications like storybook generation, long-form document creation, and multi-turn visual dialogue to very short sequences.

Difficulty: High Impact: Medium

🏆 Benchmark Leaderboard

MMMU (Expert-Level Multimodal Understanding)

Expert-level multimodal reasoning across 30 college subjects requiring domain knowledge and deliberate reasoning over diverse image types (Metric: Accuracy (%))

Rank	Method	Score	Paper	Year
🥇	InternVL3.5 with Visual Resolution Router and Cascade RL	77.7% — +22.0% over GPT-4V (55.7%), still trailing human 88.6%	InternVL3.5 (2025)	2025
🥈	Kimi-VL with native-resolution MoE decoder	64.0% — +8.3% over GPT-4V (55.7%)	Kimi-VL (2025)	2025

MathVista (Visual Mathematical Reasoning)

Multimodal mathematical reasoning combining visual perception with problem-solving across geometry, statistics, and scientific figures (Metric: Accuracy (%))

Rank	Method	Score	Paper	Year
🥇	DreamPRM with o4-mini via best-of-N selection	85.2% — +35.3% over original GPT-4V baseline (49.9%)	DreamPRM (2025)	2025
🥈	Kimi k1.5 with long-context RL	74.9% — +25.0% over GPT-4V baseline	Kimi k1.5 (2025)	2025

GenEval (Compositional Text-to-Image Generation)

Compositional accuracy in text-to-image generation across object counting, attribute binding, and spatial relationships (Metric: Overall Accuracy (%))

Rank	Method	Score	Paper	Year
🥇	DiffusionNFT with forward-process RL	98% — +55.6% absolute over base SD3.5-M (63%)	DiffusionNFT (2025)	2025
🥈	Flow-GRPO with ODE-to-SDE conversion	95% — +32% over base SD3.5-M	Flow-GRPO (2025)	2025

LIBERO (Multi-Task Robotic Manipulation)

Multi-task robotic manipulation generalization across objects, scenes, and long-horizon task sequences in simulation (Metric: Success Rate (%))

Rank	Method	Score	Paper	Year
🥇	Probe-Learn-Distill with residual RL agents	99.0% — +22.5% over standard OpenVLA (76.5%)	Self-Improving (2025)	2025
🥈	VLA-Thinker with dynamic visual perception actions	97.5% — +6.5% over OpenVLA-OFT (91.0%)	VLA-Thinker (2026)	2026

📊 Topic Distribution

Visual Question Answering

194 (6.8%)

Visual Grounding

249 (8.7%)

Image Captioning

40 (1.4%)

Document Chart Understanding

57 (2.0%)

Text To Image

238 (8.3%)

Text To Video

55 (1.9%)

Image Editing

27 (0.9%)

Unified Generation

11 (0.4%)

Video Qa Captioning

61 (2.1%)

Temporal Reasoning

25 (0.9%)

Visual Reasoning

85 (3.0%)

Hallucination Mitigation

36 (1.3%)

Multimodal Alignment

8 (0.3%)

Visual Encoders

49 (1.7%)

Token Efficiency

62 (2.2%)

Multimodal Pretraining

45 (1.6%)

Robotic Manipulation

63 (2.2%)

Autonomous Driving

31 (1.1%)

World Models

11 (0.4%)

Vision Language Understanding

751 (26.3%)

Multimodal Generation

30 (1.1%)

Video Understanding

6 (0.2%)

Multimodal Reasoning

27 (0.9%)

Architecture And Efficiency

167 (5.9%)

Embodied And Robotics

73 (2.6%)

Other

623 (21.9%)

Gui Agents

39 (1.4%)

Remote Sensing

53 (1.9%)

Audio Speech

118 (4.1%)

Medical Multimodal

178 (6.2%)

Safety Robustness

247 (8.7%)

Analysis

657 (23.0%)

Benchmark

510 (17.9%)

Application

162 (5.7%)

Survey

131 (4.6%)

📚 Glossary of Terms (429 terms)

3D Gaussian Splatting

A real-time 3D rendering technique that represents scenes as collections of 3D Gaussian primitives, enabling fast novel-view synthesis and scene reconstruction.

6DoF (Six Degrees of Freedom)

The six independent parameters describing an object's position (x, y, z) and orientation (roll, pitch, yaw) in 3D space.

a11y (Accessibility) Tree

A hierarchical representation of UI elements with semantic labels (role, name, state) used by screen readers and automation tools to interact with applications.

Activation Patching

A mechanistic interpretability technique that replaces (patches) internal activations from one input with those from another to measure the causal effect of specific model components.

Activation Steering

A technique that modifies a model's behavior by injecting learned vectors into specific attention heads during inference, enabling targeted control without full retraining.

AdaIN (Adaptive Instance Normalization)

A normalization technique that transfers the statistical properties (mean and variance) of style features onto content features, enabling style control.

Adapter

A lightweight neural network module inserted into a frozen pre-trained model to learn task-specific representations. Typically consists of a down-projection, nonlinearity, and up-projection with a residual connection.

Adversarial Perturbation

A carefully crafted, often imperceptible modification to an input (image, text, or audio) designed to cause a model to produce incorrect or harmful outputs.

AIGC (AI-Generated Content)

Digital content — including images, text, music, video, and 3D assets — created by artificial intelligence systems rather than humans.

AitW (Android-in-the-Wild)

A large-scale benchmark of real Android device control tasks collected from diverse real-world usage scenarios, testing generalization across apps and UI layouts.

Allocentric Reasoning

Spatial reasoning from a global, viewpoint-independent perspective (like a map), as opposed to egocentric reasoning from a first-person viewpoint.

ALM (Audio-Language Model)

A multimodal model that takes interleaved audio and text as input and generates text responses, enabling understanding of speech, sounds, and music alongside language.

Alt-text

Alternative text descriptions for images on webpages, designed to convey visual content to blind and low-vision users through screen readers.

AMBER Score

A composite evaluation metric combining generative hallucination rates (via CHAIR) and discriminative performance (via F1) into a single score for MLLM hallucination assessment.

AndroidWorld

An interactive benchmark for mobile GUI agents that tests task completion in a live Android emulator environment with real applications.

Anomaly Detection (AD)

The identification of patterns that deviate significantly from expected behavior, applicable to both sensory (visual defects) and semantic (novelty) domains.

Answer-Grounding Consistency

A metric that penalizes models producing correct answers without correctly identifying the relevant visual regions, measuring whether reasoning is genuinely grounded.

ASR (Attack Success Rate)

The percentage of adversarial inputs that successfully bypass a model's safety filters to elicit harmful responses.

ASR (Automatic Speech Recognition)

Technology that converts spoken language in audio into text, used in video understanding to extract dialogue and narration information.

AUROC (Area Under Receiver Operating Characteristic)

A metric measuring how well a model distinguishes between classes across all decision thresholds, where 1.0 is perfect and 0.5 is random chance.

AUROC (Area Under the Receiver Operating Characteristic)

A metric measuring a classifier's ability to distinguish between classes across all decision thresholds, where 1.0 indicates perfect discrimination and 0.5 is random chance.

Autoregressive (AR) Generation

A generation approach where tokens are produced one at a time in sequence, with each token conditioned on all previously generated tokens, commonly used in language models like GPT.

AVLN (Aerial Vision-Language Navigation)

The task of guiding a drone to a target location based on free-form natural language instructions and visual observations from the drone's camera.

AWR (Advantage-Weighted Regression)

An offline RL algorithm that weights actions in the training data by their estimated advantage, selectively reinforcing high-quality behaviors without requiring online environment interaction.

Backdoor Attack

A training-time attack where malicious patterns (triggers) are embedded in a model so it behaves normally on clean inputs but produces attacker-specified outputs when the trigger is present.

BCO (Binary Classifier Optimization)

An optimization objective that treats preference learning as a binary classification task evaluating absolute quality of individual responses rather than relative preference between pairs.

Beam Search

A decoding strategy that maintains multiple candidate sequences (beams) at each generation step, selecting the most probable continuations to produce higher-quality outputs.

Behavior Cloning (BC)

An imitation learning approach that trains a policy to replicate expert demonstrations by supervised learning on state-action pairs.

Behavioral Cloning

A supervised learning approach where a policy is trained to mimic expert demonstrations by mapping observations to actions, limited by the quantity and quality of available demonstrations.

Best-of-N (BoN)

A test-time scaling strategy where N candidate solutions are sampled and scored by a reward model, selecting the highest-rated one as the final answer.

BEV (Bird's Eye View)

A top-down representation of the driving scene that projects sensor data into a unified overhead coordinate frame, simplifying spatial reasoning for planning.

BEV (Bird's-Eye View)

A top-down representation of a 3D scene projected onto a 2D plane, commonly used in robotics and autonomous driving for spatial reasoning and planning.

BEV (Bird's-Eye-View)

A top-down representation of a 3D scene projected onto a 2D plane, commonly used in autonomous driving to provide explicit spatial and geometric understanding.

BLEU (Bilingual Evaluation Understudy)

An automatic evaluation metric for machine translation that measures the n-gram overlap between predicted and reference translations.

Blind Baseline

An evaluation control where the model receives only text (no image) to determine how much performance relies on visual input versus textual shortcuts.

BPE (Byte Pair Encoding)

A tokenization method that iteratively merges the most frequent pairs of bytes or characters, used in language models and adapted for action tokenization in VLAs.

BraTS (Brain Tumor Segmentation Challenge)

A benchmark challenge for segmenting brain tumors from multi-modal MRI scans, evaluating methods on enhancing tumor, whole tumor, and tumor core sub-regions.

BSN (Blind-Spot Network)

A self-supervised denoising architecture that masks certain pixels during training to learn noise removal without clean reference images.

Budget Forcing

An inference-time technique that prevents a model from terminating reasoning prematurely by forcing continued generation until a pre-set computational budget (e.g., number of refinement rounds) is exhausted.

CALVIN

A benchmark for long-horizon language-conditioned manipulation requiring robots to sequentially complete multiple tasks, measuring generalization to unseen instructions and scenes.

CARLA

Car Learning to Act — an open-source urban driving simulator that supports development, training, and validation of autonomous driving systems.

Cascaded Group Attention (CGA)

An attention mechanism that splits input features into chunks, feeds different chunks to different attention heads, and cascades the output of one head to the next, reducing computational redundancy.

CASPO (Consequence-Aware Safety Policy Optimization)

A safety alignment method that trains models to project causal consequences of actions rather than merely detecting malicious intent, using dynamic self-distillation rewards.

Catastrophic Forgetting

The tendency of neural networks to lose previously learned knowledge when fine-tuned on new tasks, a central challenge in continual and multi-task learning.

CFG (Classifier-Free Guidance)

A sampling technique in diffusion models that interpolates between conditional and unconditional predictions to improve generation quality and text alignment.

Chain-of-Thought (CoT)

A prompting or training technique where models generate explicit intermediate reasoning steps before producing a final answer.

CHAIR (Caption Hallucination Assessment with Image Relevance)

A metric that measures the proportion of objects mentioned in generated captions that are not present in the ground-truth image annotations, quantifying object-level hallucination.

ChartQA

A benchmark for evaluating chart understanding, requiring both data extraction and reasoning (arithmetic, comparison) over chart images.

CheXpert

A large-scale chest X-ray dataset with 224,316 images annotated for 14 pathological observations, widely used for evaluating medical imaging foundation models.

CIDEr (Consensus-based Image Description Evaluation)

A captioning metric that measures consensus between generated and reference captions using TF-IDF weighted n-gram similarity, designed specifically for image description evaluation.

CircularEval

An evaluation protocol that feeds the same multiple-choice question multiple times with shuffled answer positions, requiring the model to answer correctly in all permutations to count as correct.

CLAP (Contrastive Language-Audio Pretraining)

A contrastive learning framework that aligns audio and text representations in a shared embedding space, analogous to CLIP for images and text.

Classifier-Free Guidance (CFG)

An inference technique that amplifies the influence of text conditioning by extrapolating between conditional and unconditional model predictions.

CLEVRER

Compositional Language and Elementary Visual Reasoning — a video benchmark testing causal, descriptive, explanatory, and predictive reasoning about physical object interactions.

CLIP (Contrastive Language-Image Pre-training)

A model trained to align images and text in a shared embedding space using contrastive learning, widely used for image-text similarity and zero-shot classification.

CLIP Score

A metric measuring the alignment between generated visual content and text descriptions using OpenAI's CLIP model, where higher scores indicate better text-visual correspondence.

Co-Speech Gestures

Hand, body, and facial movements that naturally accompany speech, including rhythmic beat gestures aligned with prosody and semantic gestures that illustrate meaning.

CoBSAT

A benchmark testing compositional binding and spatial attribute reasoning in text-to-image in-context learning, measuring how well models infer and apply visual patterns from examples.

COCO (Common Objects in Context)

A large-scale benchmark for object detection, segmentation, and captioning containing 330K images with 80 object categories, used to evaluate visual encoders on dense prediction tasks.

Cognitive Gap

The disconnect in unified models between understanding visual instructions well and translating that understanding into effective generation-friendly representations.

Cold Start

An initialization phase where the model is fine-tuned on high-quality chain-of-thought data before RL training begins, establishing basic reasoning format and capabilities.

ColPali

A visual document retrieval model based on PaLI that encodes document pages as visual embeddings using a columnar late interaction mechanism similar to ColBERT.

Concept Erasure

Safety techniques that remove specific concepts (e.g., violence, NSFW) from a model's capabilities while preserving its general generation ability.

Conformal Prediction

A statistical framework that provides calibrated prediction intervals with guaranteed coverage probabilities, used here to set failure detection thresholds with controlled false alarm rates.

Consistency Distillation

A technique that trains a student model to produce outputs consistent with a multi-step teacher model in fewer steps, enabling fast inference while aiming to preserve quality.

Contrastive Learning

A training approach that learns representations by pulling similar pairs together and pushing dissimilar pairs apart in embedding space, foundational to models like CLIP.

Cosmos Tokenizer

NVIDIA's pre-trained vision tokenizer that compresses high-dimensional visual observations into compact latent representations for efficient processing.

CoT (Chain of Thought)

A prompting or training technique where models generate step-by-step reasoning traces before producing a final answer, improving accuracy on complex tasks.

CoT (Chain-of-Thought)

A prompting and training technique where models generate intermediate reasoning steps before producing a final answer, improving performance on complex multi-step tasks.

Cross-Modal Transfer

The process of adapting a model trained on one modality (e.g., RGB images) to process a different modality (e.g., depth maps, thermal images) by aligning their feature representations.

CUA (Computer-Use Agent)

An AI agent that operates desktop or mobile computer interfaces autonomously by interpreting screen content and executing mouse/keyboard actions.

cVAE (Conditional Variational Autoencoder)

A generative model that encodes data into a latent space conditioned on some input and decodes samples from that space, commonly used as a baseline for conditional generation tasks.

DAgger (Dataset Aggregation)

An imitation learning algorithm that iteratively collects new training data by having the learner act in the environment and querying the expert for corrections, addressing covariate shift.

Data Contamination

When a model's training data includes test benchmark examples, leading to artificially inflated evaluation scores that do not reflect genuine capability.

Data Poisoning

An attack that manipulates a model's training data to inject biases, backdoors, or vulnerabilities that persist after training.

DCScore

A detailed captioning evaluation metric from DeCapBench that decomposes captions into primitive information units and evaluates precision and recall individually, achieving 0.90 Spearman correlation with VLM Arena Elo ratings.

DDPM (Denoising Diffusion Probabilistic Model)

A generative model that learns to reverse a gradual noise-adding process, generating high-quality samples through iterative denoising steps from random noise.

DeiT (Data-efficient Image Transformer)

A family of Vision Transformers trained with knowledge distillation and data augmentation strategies to achieve competitive performance without requiring very large-scale pretraining datasets.

DEM (Digital Elevation Model)

A 3D representation of terrain surface created from elevation data, commonly used alongside optical and SAR imagery for geospatial analysis.

Denoising Steps

The iterative process in diffusion and flow models where noise is progressively removed from a sample, with each step refining the output closer to the final generated result.

DFS (Depth-First Search)

A graph traversal algorithm that explores as far as possible along each branch before backtracking; used in GUI-DFS for systematic environment exploration.

DGRPO (Difficulty-aware GRPO)

A variant of GRPO that scales rewards based on task and sample difficulty, preventing easier tasks from dominating the training gradient.

Dice Score (DSC)

A spatial overlap metric for segmentation tasks measuring the agreement between predicted and ground-truth masks, ranging from 0 (no overlap) to 1 (perfect).

Diffusion Model

A generative model that learns to reverse a gradual noise-addition process, iteratively denoising random noise into structured outputs like images, 3D shapes, or molecular structures.

Diffusion Policy

A control policy that generates action trajectories through iterative denoising, producing smooth, multi-modal action distributions suitable for robotic manipulation and navigation.

Diffusion Transformer (DiT)

A transformer-based architecture for diffusion models that replaces the traditional U-Net backbone, enabling better parameter scalability and attention-based spatial-temporal reasoning.

DINO (Self-Distillation with No Labels)

A self-supervised vision transformer training method that produces semantically rich visual features without labels, often used for fine-grained visual understanding.

DINOv2

A self-supervised vision foundation model from Meta that produces high-quality patch-level visual features useful for downstream tasks without fine-tuning.

DiT (Diffusion Transformer)

A diffusion model that uses a transformer backbone instead of UNet for the denoising network, enabling better scalability and attention-based processing.

DocVQA

A benchmark for document visual question answering that tests reading and reasoning over text, tables, and layouts within document images.

DocVQA (Document Visual Question Answering)

A benchmark and task requiring models to answer questions about document images by reading and understanding text, layout, and visual elements.

DOM (Document Object Model)

A tree-structured representation of a web page's elements, providing programmatic access to page content and structure.

DPA (Dynamic Proportional Accuracy)

A reward function that gives partial credit for partially correct multiple-choice answers, providing denser feedback than binary (correct/incorrect) rewards during RL training.

DPO (Direct Preference Optimization)

A training method that directly optimizes a model's policy from preference pairs (preferred vs. rejected responses) without requiring a separate reward model.

DreamBench++

A benchmark evaluating text-to-image generation quality in terms of compositional consistency and subject fidelity in complex prompts.

DreamBooth

A personalization method that fine-tunes the entire diffusion model on a few subject images with a unique identifier token, serving as a common baseline for identity preservation.

Dynamic Resolution

Processing images at their original resolution and aspect ratio rather than resizing to a fixed grid, preserving fine-grained visual details.

ECoT (Embodied Chain-of-Thought)

A reasoning approach where the VLA model generates step-by-step plans, sub-tasks, and spatial grounding annotations before predicting physical actions.

ECR (Embedding-Centric Reasoning)

An intermediate reasoning trace generated by a language model specifically designed to enrich the subsequent embedding, bridging generative reasoning and representation learning.

ECTF (Efficient Complete Teacher Forcing)

A training technique that decouples clean history from noisy targets via masking, reducing training complexity from quadratic to linear for interleaved multi-modal sequences.

EditReward-Bench

A benchmark for evaluating reward model accuracy on image editing tasks by measuring alignment with human expert judgments across semantic consistency and perceptual quality.

Egocentric Video

Video captured from a first-person perspective (e.g., head-mounted camera), requiring inference of the camera wearer's actions and intentions that are not directly visible.

ELBO (Evidence Lower Bound)

A tractable lower bound on the log-likelihood of data used to train variational models; in diffusion models, minimizing ELBO is equivalent to minimizing weighted squared error between predicted and actual noise.

Elo Rating

A rating system originally designed for chess that estimates relative skill levels based on pairwise win/loss outcomes, adapted by CapArena for ranking captioning models.

EMA (Exponential Moving Average)

A smoothing technique that maintains a slowly updating average of model parameters or statistics, used in GRPO-CARE for adaptive reward normalization.

EMA-GRPO

An RL algorithm that maintains exponential moving averages of task-specific reward statistics, enabling adaptive reward normalization across diverse training tasks.

Embodied AI

AI systems that perceive the world through sensors and act upon it through actuators, requiring integration of perception, reasoning, and physical action in real or simulated environments.

End-to-End (E2E) Driving

An approach where a single model maps raw sensor inputs directly to vehicle control outputs, eliminating hand-crafted interfaces between perception, prediction, and planning modules.

Endmember

In hyperspectral unmixing, a pure material spectrum (e.g., water, vegetation, road) that combines in varying proportions to form the mixed spectrum of each pixel.

Entropy Collapse

A failure mode in RL training where the model's output distribution becomes too narrow, prematurely converging to a small set of behaviors and stopping exploration.

EPDMS (Ego-Pseudo Driving Metric System)

A composite driving evaluation metric used in the NAVSIM benchmark that measures safety, comfort, traffic rule compliance, and progress in navigation scenarios.

Event Bottleneck

The phenomenon where interleaved image generation quality collapses based on the number of discrete visual events (approximately 20 images) rather than total token count.

Factuality Hallucination

Model output that contradicts verifiable world knowledge (e.g., stating an incorrect historical fact) even if not directly testable from the input alone.

FAD (Fréchet Audio Distance)

The audio equivalent of FID, measuring the quality of generated audio by comparing feature distributions to real audio; lower scores indicate better quality.

Faithfulness Hallucination

Model output that contradicts evidence directly present in the input (e.g., describing a red car as blue when the image clearly shows red).

FGD (Fréchet Gesture Distance)

A metric measuring the distributional similarity between generated and real gesture sequences. Lower values indicate more realistic, human-like generated motion.

FGL (Federated Graph Learning)

A distributed learning paradigm where graph-structured data is trained collaboratively across multiple clients without sharing raw data, preserving privacy.

FID (Frechet Inception Distance)

A metric measuring the quality and diversity of generated images by comparing their feature distributions to real images.

FID (Fréchet Inception Distance)

A metric measuring the statistical similarity between generated and real image distributions, where lower scores indicate higher quality and diversity of generated outputs.

FiLM (Feature-wise Linear Modulation)

A conditioning mechanism that injects external information (like language goals) into neural network features by learning per-channel scaling and bias parameters.

FLOPs (Floating Point Operations)

A measure of computational cost counting the number of floating-point arithmetic operations required for inference, commonly used to compare model efficiency.

Flow Matching

A generative modeling framework that learns continuous-time velocity fields to transport samples from a noise distribution to a target data distribution, offering stable training alternatives to diffusion models.

Flow Matching / Rectified Flow

A generative framework that learns straight-line paths between noise and data distributions, often more efficient than traditional diffusion's curved trajectories.

Forward Kinematics (FK)

Computing the position and orientation of end effectors given joint angles, as opposed to Inverse Kinematics which computes joint angles from desired end-effector poses.

FOV (Field of View)

The angular extent of the observable environment that a sensor (camera or LiDAR) can capture at any given moment.

FP-SFT (Flexible Progressive Supervised Fine-tuning)

A training curriculum that starts with low-resolution, high-throughput training and progressively moves to high-resolution fine-tuning for image generation quality.

FP8 (8-bit Floating Point)

A compact floating-point format using 8 bits total, offering better dynamic range than INT8 for handling outlier activations in deep learning models.

FPO (Factorized Preference Optimization)

An optimization method that separately optimizes textual quality and temporal grounding accuracy using factorized preference pairs, enabling independent improvement of each capability.

FVD (Fréchet Video Distance)

An extension of FID to video that measures distributional similarity between generated and real video clips, accounting for both per-frame visual quality and temporal coherence.

FVWM (Foundation Veridical World Model)

A proposed class of models that combine the broad generalization of foundation models with truthful (veridical) dynamic modeling via causal reasoning.

GAN (Generative Adversarial Network)

A generative model consisting of a generator and discriminator trained in opposition, where the generator learns to produce realistic samples that fool the discriminator.

GAT (Graph Attention Network)

A type of graph neural network that uses attention mechanisms to weight the importance of neighboring nodes when aggregating information across the graph.

Gaussian Splatting

A real-time 3D rendering technique representing scenes as collections of 3D Gaussian functions that are projected and blended for novel view synthesis.

GCN (Graph Convolutional Network)

A neural network that operates on graph-structured data, propagating and aggregating information along edges to learn node and graph-level representations.

GELU (Gaussian Error Linear Unit)

An activation function commonly used in transformers that produces asymmetric outputs with a long negative tail, creating challenges for symmetric quantization methods.

GenAIBench

A benchmark measuring alignment between generated images and complex compositional text prompts using VQA-based scoring across diverse scenarios.

GenEval

A benchmark that evaluates compositional text-to-image generation across object counting, attribute binding, spatial relationships, and multi-object scenes.

GEO (Generative Engine Optimization)

Strategies for optimizing content to be discoverable and cited by AI-powered generative search engines like ChatGPT or Google AI Overviews.

gIoU (Generalized Intersection over Union)

A metric for evaluating segmentation quality that measures the overlap between predicted and ground-truth masks, with generalization to non-overlapping cases.

GMM (Gaussian Mixture Model)

A probabilistic model that represents a distribution as a mixture of multiple Gaussian components, used in retrieval to dynamically separate relevant from irrelevant results.

GNN (Graph Neural Network)

Neural networks designed to operate on graph-structured data, learning node and edge representations by aggregating information from local neighborhoods.

Grounding DINO

A state-of-the-art open-set object detection model that combines DINO (a self-supervised vision transformer) with language grounding to detect arbitrary objects specified by text prompts.

GRPO (Group Relative Policy Optimization)

A reinforcement learning algorithm that normalizes rewards within groups of sampled trajectories rather than using a separate value network, widely used for post-training language and multimodal models.

GUI (Graphical User Interface)

The visual interface of software applications that users interact with through elements like buttons, menus, text fields, and other on-screen components.

GUI Agent

An AI system that interacts with graphical user interfaces by interpreting screenshots and executing actions like clicking, typing, and scrolling to complete user-specified tasks.

Hallucination

When a VLM generates information (objects, attributes, relationships) not present in the visual input, often by defaulting to language model priors rather than visual evidence.

Hash Encoding (Multi-Resolution)

A technique that maps spatial coordinates to learned feature vectors via hash tables at multiple resolutions, enabling compact and efficient neural scene representations.

HD Map (High-Definition Map)

A highly detailed, centimeter-accurate digital map containing lane boundaries, traffic signs, and road topology, used by autonomous vehicles for precise localization and planning.

Hessian

A matrix of second-order partial derivatives used to measure the sensitivity of model output to weight perturbations, guiding which weights or layers are most important to preserve during quantization.

Hessian Matrix

A matrix of second-order partial derivatives measuring the curvature of the loss landscape. In quantization, it estimates how sensitive a model's output is to perturbations in each weight or activation.

HPSv2 (Human Preference Score v2)

A learned reward model trained on large-scale human preference data to predict aesthetic quality and text-image alignment scores for generated visual content.

HPSv2 / HPSv2.1 (Human Preference Score)

A metric trained on human preference data that predicts how well a generated image aligns with human aesthetic and prompt-adherence standards.

Hungarian Algorithm

An optimization method for solving assignment problems that finds the optimal one-to-one matching between two sets, used in models like ShapeLLM for matching 3D tokens with 2D view features.

IDM (Intelligent Driver Model)

A classic car-following model that computes acceleration based on desired velocity, distance to the lead vehicle, and velocity difference, used as a baseline planner.

ImageBind

A unified embedding model from Meta that learns a shared representation space across six modalities (image, text, audio, depth, thermal, IMU) using image-paired data.

ImageNet

A large-scale image classification benchmark with 1.28 million training images across 1000 categories, widely used to evaluate vision model accuracy and quantization quality.

ImageNet-1K

A large-scale image classification benchmark containing 1.28 million training images across 1,000 categories, widely used as the primary evaluation standard for visual encoder quality.

ImageReward

A reward model that scores generated images based on human feedback, commonly used to fine-tune or evaluate text-to-image models.

Imitation Learning

A training paradigm where a model learns to replicate expert demonstrations, mapping observed states to actions without explicit reward signals.

IMLE (Implicit Maximum Likelihood Estimation)

A generative modeling technique that trains a model to cover all modes of a target distribution using a set-based loss function, avoiding mode collapse.

Importance Ratio

In policy optimization, the ratio of the probability of an action under the new policy versus the old policy, used to constrain policy updates and prevent training instability.

InfoNCE Loss

A contrastive loss function that maximizes the mutual information between positive pairs relative to negative pairs using a softmax-based formulation with a temperature scaling parameter.

Inpainting

The task of filling in missing or masked regions of an image with plausible content, often guided by surrounding context or text prompts.

INR (Implicit Neural Representation)

A neural network that represents a signal (e.g., image or 3D shape) as a continuous function mapping coordinates to values, enabling resolution-independent processing.

Inverse Dynamics Model (IDM)

A model that infers what action was taken given observations before and after the action, used to provide action-consistency reward signals in world model training.

IoU (Intersection over Union)

A metric for evaluating segmentation quality, calculated as the overlap between predicted and ground-truth masks divided by their union. Higher is better.

IQA (Image Quality Assessment)

The task of evaluating the perceptual quality of images, including detecting specific degradation types such as compression artifacts, blur, noise, and color distortion.

IRL (Inverse Reinforcement Learning)

A technique that infers a reward function from expert demonstrations rather than requiring manually specified rewards, used to learn what objectives an expert is optimizing.

ISLR (Isolated Sign Language Recognition)

The task of identifying individual sign language glosses (words or phrases) from video clips, typically focusing on single-sign classification.

ISTS (Irregularly Sampled Time Series)

Time series data where observations arrive at non-uniform intervals across variables, common in healthcare, finance, and IoT applications, requiring special handling of missing data and temporal irregularity.

Jailbreak Attack

An adversarial technique that tricks a safety-aligned model into generating harmful content by bypassing its safety filters, often exploiting the visual modality in VLMs.

JEPA (Joint Embedding Predictive Architecture)

An architecture that predicts representations of future observations in an abstract embedding space rather than predicting raw pixels, proposed by Yann LeCun for efficient world modeling.

JSCC (Joint Source-Channel Coding)

A communication technique that jointly optimizes data compression and error protection, offering graceful quality degradation under varying channel conditions.

KB-VQA (Knowledge-Based Visual Question Answering)

A VQA variant where answering requires external world knowledge beyond what is visible in the image, such as historical facts or scientific concepts.

KDE (Kernel Density Estimation)

A non-parametric statistical method for estimating probability density functions by smoothing observed data points with kernel functions, used as a theoretical bridge for generative models.

KG (Knowledge Graph)

A structured representation of facts as entities (nodes) connected by relationships (edges), used for reasoning, question answering, and information retrieval.

KL Divergence (Kullback-Leibler Divergence)

A statistical measure of how one probability distribution differs from another, used in RL to measure policy changes and in SPIKE-RL to quantify visual surprise.

Knowledge Distillation

A model compression technique where a smaller 'student' model learns to mimic the behavior of a larger 'teacher' model, transferring knowledge without requiring the teacher's full architecture at inference.

Knowledge Graph

A structured representation of entities and their relationships stored as nodes and edges, used to encode domain knowledge for reasoning, retrieval, and recommendation tasks.

KV-Cache (Key-Value Cache)

An optimization technique that stores previously computed attention keys and values to avoid redundant computation during autoregressive or step-wise generation.

LALM (Large Audio-Language Model)

A large language model extended with audio encoders to understand and reason about non-speech sounds, music, and acoustic events alongside text.

LayerNorm (Layer Normalization)

A normalization technique used in transformers that can produce extreme inter-channel variance, creating outlier activations that complicate quantization.

LCMR (Latent Cross-Modality Regularizer)

A technique that forces speech tokens to be geometrically close to their corresponding text transcript tokens in latent space, improving speech-text alignment without full transcription.

LIBERO

A simulation benchmark for evaluating robotic manipulation policies across multiple tasks, objects, and scenes, widely used for measuring VLA model performance.

LiDAR (Light Detection and Ranging)

A remote sensing technology that uses laser pulses to create precise 3D point clouds of the surrounding environment.

Linear Probing

An evaluation protocol where a frozen pretrained backbone is assessed by training only a simple linear classifier on top, measuring the intrinsic quality of the learned feature representations.

LLM-as-a-Judge

An evaluation paradigm where a powerful language model (like GPT-4) is used to score or compare model outputs, serving as a proxy for human evaluation.

Log2 Quantizer

A non-uniform quantization method that maps values to a logarithmic (base-2) scale, allocating finer precision to smaller values. Used for power-law distributions like post-Softmax activations.

Long-Tail Scenarios

Rare but safety-critical driving situations (e.g., construction zones, unusual pedestrian behavior) that are underrepresented in training data and disproportionately cause system failures.

LoRA (Low-Rank Adaptation)

A PEFT technique that injects trainable low-rank matrices into frozen transformer layers, enabling efficient fine-tuning by decomposing weight updates into products of smaller matrices.

LPIPS (Learned Perceptual Image Patch Similarity)

A perceptual similarity metric that measures the distance between images using deep network features, more aligned with human perception than pixel-level metrics.

LVBench

A benchmark for long-form video understanding requiring comprehension of videos averaging over 1 hour, used to evaluate long-context video reasoning.

M-RoPE (Multimodal Rotary Position Embedding)

A positional encoding that decomposes into three components (time, height, width), enabling a unified coordinate system for both images and videos within a single model.

Macro-AUC

Area Under the ROC Curve averaged equally across all classes, measuring classification performance while giving equal weight to both rare and common conditions.

MAE (Masked Autoencoder)

A self-supervised pretraining method that randomly masks portions of the input (image patches or tokens) and trains the model to reconstruct the missing parts.

Mamba

A state-space model architecture that processes sequences with linear complexity (vs. quadratic for transformers), enabling efficient handling of long sequences in vision and language tasks.

mAP (Mean Average Precision)

A metric averaging precision scores across different recall thresholds, commonly used in temporal grounding and moment retrieval tasks.

MAS (Multi-Agent System)

An AI architecture where multiple specialized agents collaborate through structured communication to solve complex tasks, each agent handling a specific role or domain.

MathVision

A competition-level visual mathematics benchmark containing problems that require advanced geometric and algebraic reasoning from mathematical diagrams and figures.

MathVista

A benchmark for evaluating multimodal mathematical reasoning, containing problems that require interpreting visual diagrams, charts, and figures alongside textual mathematical questions.

Matryoshka Representation Learning

A training approach that creates nested representations of varying dimensions, enabling a single model to operate at multiple computational budgets by using subsets of the full representation.

MCoT (Multimodal Chain-of-Thought)

Extension of chain-of-thought reasoning to multimodal contexts, where intermediate reasoning steps may involve images, audio, or 3D data.

MCP (Model Context Protocol)

A standardized communication protocol for connecting AI models to external tools and applications.

MCQ (Multiple-Choice Question)

A question format with predefined answer options, commonly used in benchmarks but vulnerable to guessing and positional bias.

MCTS (Monte Carlo Tree Search)

A search algorithm that explores decision trees by running random simulations to estimate the value of each branch, used here to generate step-level correctness labels automatically.

MDM (Masked Diffusion Model)

A discrete diffusion model that operates by progressively unmasking tokens from a fully masked state, similar to masked language modeling but applied to generation.

MedSAM

An adaptation of SAM fine-tuned on 1.5 million medical image-mask pairs to enable universal segmentation across diverse medical imaging modalities.

MedSAM (Medical Segment Anything Model)

A foundation model for universal medical image segmentation adapted from Meta's SAM, trained on 1.5 million medical image-mask pairs across 10 imaging modalities.

MHSA (Multi-Head Self-Attention)

The core mechanism of Transformers where input tokens attend to each other through multiple parallel attention 'heads', each learning different relationship patterns across the sequence.

MIL (Multiple Instance Learning)

A weakly supervised learning paradigm where labels are assigned to 'bags' of instances (e.g., a slide containing many patches), and the model must learn which instances contribute to the bag-level label.

MIMIC-CXR

A large publicly available dataset of chest X-rays paired with radiology reports from Beth Israel Deaconess Medical Center, widely used as a benchmark for medical report generation.

MIMO (Multi-Input Multi-Output)

An antenna technology using multiple antennas at both transmitter and receiver to improve wireless communication performance.

mIoU (Mean Intersection over Union)

The average IoU across all classes in a semantic segmentation task, providing a single metric for overall segmentation quality.

MIRCs (Minimal Identifiable Recognition Crops)

The smallest spatial or spatiotemporal regions in a video that are sufficient for human recognition of an action, used to benchmark AI robustness against humans.

Mixed-Precision Quantization

Assigning different bit-widths to different layers based on their sensitivity, allowing critical layers to maintain higher precision while less sensitive layers are compressed more aggressively.

Mixture-of-Experts (MoE)

An architecture where multiple specialized sub-networks (experts) process different inputs, selected by a learned router. Only a subset of experts activate per input, maintaining computational efficiency.

MLD (Multi-Modal Latent Diffusion)

A framework using independent deterministic autoencoders per modality and a shared score-based diffusion model to learn joint multi-modal distributions without information loss.

MLLM (Multi-Modal Large Language Model)

A large language model extended to process multiple input modalities (images, point clouds, text) alongside language for reasoning and generation.

MLLM (Multimodal Large Language Model)

A large language model augmented with the ability to process non-text modalities (images, audio, video) alongside text, enabling unified reasoning across diverse input types.

MLLM / LMM (Multimodal Large Language Model / Large Multimodal Model)

Large-scale models that can process and generate content across multiple modalities (text, images, video, audio), extending LLM capabilities to visual inputs.

MLLM / MLLMs (Multimodal Large Language Models)

Large language models augmented with visual encoders to process and reason over both text and images, often combining architectures like CLIP with language models like Vicuna or Qwen.

MLRM (Multimodal Large Reasoning Model)

An enhanced VLM equipped with explicit chain-of-thought reasoning capabilities, enabling step-by-step logical inference over multimodal inputs.

MM-DiT (Multi-Modal Diffusion Transformer)

A diffusion model architecture that uses a unified attention mechanism concatenating text and image tokens, replacing the separate cross-attention used in U-Net architectures.

MM-DiT (Multimodal Diffusion Transformer)

A diffusion model architecture that processes text and image tokens jointly through a unified bidirectional attention mechanism, replacing the separate cross-attention of older UNet-based models.

MM-LLM (Multimodal Large Language Model)

A large language model extended to accept and/or generate content across multiple modalities such as text, images, audio, and video.

MM-PTM (Multi-Modal Pre-Trained Model)

Models pre-trained on multiple modalities (text, images, audio, video) simultaneously, learning cross-modal representations from large-scale paired datasets.

mm-Wave (Millimeter Wave)

Radio frequencies in the 24-100 GHz range used for high-bandwidth 5G communications, requiring precise beam alignment due to narrow beamwidths.

MMAU (Massive Multi-Task Audio Understanding)

A benchmark of 10,000 expert-annotated audio clips across speech, music, and environmental sounds, testing 27 distinct expert skills requiring domain knowledge and complex reasoning.

MMAU (Multimodal Audio Understanding)

A benchmark evaluating audio-language models on complex audio question answering across sound, music, and speech categories.

MMDiT (Multimodal Diffusion Transformer)

A diffusion transformer architecture designed to process multiple modalities (e.g., video and audio streams) in parallel with cross-modal attention layers for synchronized joint generation.

MME

A comprehensive evaluation benchmark measuring both perception and cognition capabilities of multimodal models across 14 subtasks.

MME Perception

A benchmark evaluating multimodal large language models on perception-related tasks including existence recognition, count estimation, position understanding, color identification, and OCR.

MME-3DR

A benchmark for evaluating text-to-3D generation quality, measuring how well generated 3D assets match text descriptions across diverse object categories including stylized representations.

MMEB (Massive Multimodal Embedding Benchmark)

A benchmark for evaluating multimodal embedding models across diverse tasks including retrieval, classification, and compositional reasoning with varying instruction complexity.

MMHal-Bench

A benchmark specifically designed to evaluate hallucination in multimodal models across 8 task types and 12 object categories, penalizing responses containing ungrounded information.

MMKG (Multi-Modal Knowledge Graph)

A knowledge graph that represents entities and their relationships using multiple modalities including text descriptions, images, and structured relational data.

MMLU (Massive Multitask Language Understanding)

A benchmark measuring knowledge and reasoning across 57 academic subjects from elementary to professional level.

MMMU

Massive Multi-discipline Multimodal Understanding benchmark — 11.5K expert-level questions from college exams across 30 subjects, testing advanced multimodal reasoning.

MMMU (Massive Multi-discipline Multimodal Understanding)

A benchmark testing college-level multimodal reasoning across 30 subjects and 183 subfields, requiring both visual understanding and domain expertise.

MMStar

A carefully curated VLM benchmark that filters out text-solvable questions and leaked samples to ensure genuine multimodal understanding is required.

mmWave (Millimeter Wave) Radar

A sensing technology using radio waves in the 30–300 GHz band to detect objects and motion; used for privacy-preserving human pose estimation as it does not capture identifiable images.

Modality Degradation

The phenomenon where fine-tuning a language model on visual instruction data causes its original text-only instruction-following capabilities to significantly decline.

Modality Gap

The separation between image and text embeddings in a shared representation space, which can weaken safety alignment when visual inputs shift the model's internal state away from its text-trained safety boundaries.

ModelNet40

A benchmark dataset containing 12,311 CAD models across 40 categories for evaluating 3D object classification methods.

MoE (Mixture of Experts)

An architecture that routes different inputs to specialized sub-networks (experts) via a gating mechanism, allowing the model to scale capacity without proportionally increasing compute per input.

MoE (Mixture-of-Experts)

An architecture where multiple specialized sub-networks (experts) are activated selectively based on input characteristics, improving capacity without proportionally increasing computation.

MosIT (Modality-switching Instruction Tuning)

A training strategy introduced in NExT-GPT that teaches multimodal models to handle complex cross-modal understanding and generation through modality-switching instructions.

MOTA (Multiple Object Tracking Accuracy)

A comprehensive tracking metric that combines false positives, missed detections, and identity switches into a single accuracy score.

MPC (Model Predictive Control)

A control strategy that uses a model of the system to predict future states and optimize actions over a receding time horizon.

MPJPE (Mean Per Joint Position Error)

A standard metric for pose estimation measuring the average Euclidean distance between predicted and ground-truth joint positions in millimeters.

MPO (Mixed Preference Optimization)

A training method that uses preference data from a reward model to optimize a downstream VLM, mixing preference signals for improved reasoning.

MTP (Multi-Token Prediction)

A decoding strategy where the model predicts multiple output tokens per step rather than one, significantly improving inference throughput.

MVBench

A comprehensive benchmark with 20 temporal video understanding tasks (e.g., action sequence, moving direction, scene transition) designed to evaluate dynamic temporal reasoning in MLLMs.

MVoT (Multimodal Visualization-of-Thought)

A reasoning paradigm where models generate interleaved image tokens representing intermediate environmental states during spatial reasoning.

NaViT (Native Resolution Vision Transformer)

A vision transformer architecture that processes images at their native resolution using variable-length token sequences, avoiding information loss from fixed-size resizing.

NAVSIM

A large-scale benchmark for evaluating autonomous driving planning models using real-world driving logs with standardized evaluation metrics.

NDCG (Normalized Discounted Cumulative Gain)

An information retrieval metric that measures ranking quality, giving higher weight to relevant results appearing earlier in the ranked list.

NDS (nuScenes Detection Score)

A composite metric for 3D object detection on the nuScenes benchmark, combining mAP with measures of localization, size, orientation, velocity, and attribute accuracy.

NeRF (Neural Radiance Field)

A neural network that represents a 3D scene as a continuous volumetric function, enabling photorealistic novel-view synthesis from a set of input images.

nSFT (Negative Supervised Fine-Tuning)

A method that incorporates logit information from rejected responses into SFT, capturing the key benefit of RLHF without requiring full reinforcement learning training.

nuScenes

A large-scale autonomous driving dataset featuring multi-modal sensor data (cameras, LiDAR, radar) from 1,000 urban driving scenes with 3D annotations.

Occupancy Forecasting

Predicting which 3D voxels or grid cells in a scene will be occupied at future time steps, used in driving to anticipate where vehicles and obstacles will be.

Occupancy Prediction

The task of predicting which 3D voxels in the environment are occupied by objects, providing a dense geometric representation for safe navigation.

OCR (Optical Character Recognition)

The technology for converting images of text (from scanned documents, photos, or PDFs) into machine-readable text.

ODE (Ordinary Differential Equation)

A deterministic formulation of the generation process that follows a fixed path from noise to image, faster but prevents the stochastic exploration needed for RL.

OmniDocBench

A comprehensive benchmark for evaluating general document parsing quality across diverse document types including text, tables, formulas, charts, and mixed layouts.

OOD (Out-of-Distribution)

Data or scenarios that differ significantly from the training distribution, often causing model performance degradation due to unfamiliar patterns.

OOD Detection (Out-of-Distribution Detection)

The task of identifying test samples that come from a different distribution than the training data, crucial for safe deployment of machine learning models.

OpenCompass

A comprehensive evaluation platform for large language and multimodal models that aggregates scores across diverse benchmarks for standardized comparison.

Optical Flow

A dense 2D vector field describing the apparent motion of pixels between consecutive video frames, representing how each pixel moves over time.

ORM (Outcome Reward Model)

A model that evaluates only the final answer or outcome of a reasoning chain, providing coarse-grained reward without step-level feedback.

OSM (OpenStreetMap)

A collaborative, open-source geographic database providing crowdsourced map data including roads, buildings, and points of interest with precise geolocation.

OSR (Open Set Recognition)

The task of classifying known classes while rejecting unknown classes at test time. VLMs have largely rendered this task obsolete by providing strong zero-shot recognition.

OSWorld

A benchmark for evaluating GUI agents on desktop tasks across real Linux applications (LibreOffice, GIMP, browsers) in a live virtual machine environment.

PA-MPJPE (Procrustes Analysis MPJPE)

A pose estimation metric that first aligns the predicted skeleton to the ground truth using Procrustes analysis (rigid transformation) before computing MPJPE, isolating shape accuracy from global position.

PageRank

An algorithm originally designed to rank web pages by importance, adapted in AIM to identify and prune unimportant visual tokens based on attention weight graphs.

Paralinguistic Features

Non-verbal vocal cues including tone, pitch, emotion, speaking rate, and hesitation patterns that convey meaning beyond the literal words spoken.

Patchify Stem

The initial layer of a ViT that divides an input image into non-overlapping patches and projects them into embedding vectors. Larger patch sizes (e.g., 16x16) reduce token count but may lose fine spatial detail.

PCKh@0.5 (Percentage of Correct Keypoints with Head-Normalized Threshold)

A pose estimation metric that counts the fraction of predicted keypoints within a distance threshold proportional to the head size.

PCS (Personalized Content Synthesis)

Techniques for generating content tailored to specific user-provided examples, balancing subject fidelity (visual resemblance) with text alignment (following the prompt).

PDMS (Pseudo Driving Metric System)

A driving quality metric that evaluates trajectory safety and feasibility without requiring full closed-loop simulation, used as a computationally efficient proxy.

PEFT (Parameter-Efficient Fine-Tuning)

A family of methods that adapt large pre-trained models to downstream tasks by updating only a small subset of parameters (typically <5%), preserving pre-trained knowledge while reducing computational cost.

Personalization / Subject-Driven Generation

Techniques that customize a pre-trained text-to-image model to generate images of specific user-provided subjects while following new text prompts.

PhysReason

A 1,200-problem physics reasoning benchmark requiring an average of 8.1 solution steps with step-level evaluation via automated scoring.

PickScore

A human preference prediction metric trained on the Pick-a-Pic dataset, measuring alignment between generated images and human preferences.

Pix2Pix

A conditional GAN architecture for image-to-image translation that uses a U-Net generator and PatchGAN discriminator, widely used for paired image transformation tasks.

Pixel-Space Reasoning

A paradigm where VLMs interleave text generation with active visual operations (zoom-in, crop, frame-select) to inspect fine-grained image details during reasoning.

Point Cloud

A set of 3D data points in space representing the external surface of objects, commonly captured by LiDAR sensors or depth cameras for autonomous driving and robotics.

POMDP (Partially Observable Markov Decision Process)

A decision framework where an agent makes sequential choices with incomplete information about the environment state, used to model document navigation.

POPE (Polling-based Object Probing Evaluation)

A hallucination evaluation benchmark that asks binary 'Is there a [object]?' questions with adversarial negative sampling to test VLM visual grounding.

Post-Training Quantization (PTQ)

A compression technique that reduces model precision (e.g., from 32-bit to 4-bit) after training, enabling faster inference with minimal quality loss.

PoT (Power-of-Two)

Quantization scaling factors restricted to powers of two, enabling efficient implementation via simple bit-shift operations instead of costly floating-point multiplications.

Power-of-Two (PoT) Quantization

A quantization scheme where scaling factors are restricted to powers of two, enabling re-quantization via simple bit-shift operations instead of costly floating-point multiplication.

PPO (Proximal Policy Optimization)

A widely-used reinforcement learning algorithm that constrains policy updates to a trust region using clipping, preventing large destabilizing changes during training.

Prefix Tuning

A parameter-efficient fine-tuning method that prepends learnable continuous vectors to the input of a frozen language model, adapting it to new tasks without modifying its weights.

Privileged Learning / Asymmetric Actor-Critic

A training strategy where the critic or teacher has access to ground-truth state information during training, while the actor or student operates only from limited sensory inputs like camera images.

PRM (Process Reward Model)

A reward model that evaluates the correctness of each intermediate reasoning step rather than only the final answer, providing fine-grained step-level supervision for training.

Process Reward Model (PRM)

A model that evaluates the quality of intermediate steps in a generation process rather than only scoring the final output, enabling more fine-grained guidance during search.

Product of Experts

A probabilistic framework that combines multiple models by multiplying their output distributions (summing log-probabilities), allowing each expert to contribute its specialized knowledge.

PSAS (Physics Solution Auto Scoring)

An automated evaluation framework that maps model outputs to annotated reasoning steps to verify theorem application and calculation independently, achieving >98% agreement with human annotations.

PTQ (Post-Training Quantization)

A compression technique that converts model weights and activations from high-precision (e.g., FP32) to lower-precision formats (e.g., INT8, INT4) using a small calibration dataset, without requiring full retraining.

Q-Former

A Queried Transformer module used to bridge vision encoders and language models by learning a fixed number of query tokens that extract relevant visual information.

Q-Former (Querying Transformer)

A lightweight transformer module that learns to extract fixed-length query-based representations from variable-length visual features, used in models like BLIP-2 for efficient vision-language bridging.

Q-MoE (Quality-Driven Mixture-of-Experts)

A routing mechanism that dynamically selects the best output from multiple expert models at each processing step based on perceptual quality metrics.

QAT (Quantization-Aware Training)

A quantization approach that simulates low-precision arithmetic during the full training process, typically achieving higher accuracy than PTQ but requiring significantly more compute.

QLoRA (Quantized Low-Rank Adaptation)

A parameter-efficient fine-tuning method combining 4-bit quantization of base model weights with LoRA adapters to dramatically reduce memory requirements.

QM/MM (Quantum Mechanics/Molecular Mechanics)

A computational chemistry method that treats a small region with quantum mechanics and the surrounding environment with classical force fields.

R1 Regularization

A gradient penalty technique for GANs that penalizes the discriminator's gradient magnitude on real data samples, improving training stability especially for large-scale adversarial post-training.

RadGraph F1

A clinical evaluation metric that measures the accuracy of extracted medical entities and their relationships from radiology reports, more clinically meaningful than standard text overlap metrics.

RAG (Retrieval Augmented Generation)

A technique that enhances language model generation by retrieving relevant external documents from a knowledge base and incorporating them into the model's input context.

RAG (Retrieval-Augmented Generation)

A technique where a model retrieves relevant external information from a knowledge base before generating a response, reducing hallucinations.

ReAct (Reasoning and Acting)

An agent framework where an LLM alternates between generating reasoning traces (thought) and executing actions (tool calls), enabling multi-step problem solving.

Reasoning Tax

The phenomenon where training models for explicit chain-of-thought reasoning inadvertently degrades their safety alignment, making them more vulnerable to jailbreak attacks.

Rectified Flow

A generative modeling framework that learns straight-line transport paths between noise and data distributions, offering more efficient sampling compared to standard diffusion processes.

RefCOCO/RefCOCO+/RefCOCOg

A family of benchmark datasets for referring expression comprehension, where models must localize objects in COCO images given natural language descriptions of varying complexity.

RefCOCOg

A visual grounding benchmark that tests a model's ability to locate specific objects in images based on natural language referring expressions.

Referring Expression Comprehension (REC)

A visual grounding task where the model must identify and localize a specific object in an image based on a natural language description (e.g., 'the red mug next to the laptop').

Reward Hacking

A failure mode where a model learns to exploit patterns in the reward signal (e.g., producing longer or more formatted outputs) rather than genuinely improving quality.

Reward Model

A model trained to score or rank model outputs based on human preferences, providing the reward signal used in RLHF training.

RFT (Reinforcement Fine-Tuning)

A post-training approach that uses reinforcement learning signals (rewards) rather than labeled examples to fine-tune a pre-trained model.

RIS (Reconfigurable Intelligent Surface)

A planar surface composed of programmable elements that can dynamically manipulate electromagnetic waves to improve wireless communication coverage.

RL (Reinforcement Learning)

A machine learning paradigm where an agent learns to make decisions by receiving reward or penalty signals from an environment, optimizing cumulative long-term reward.

RLHF (Reinforcement Learning from Human Feedback)

A training paradigm where human preference judgments are used as reward signals to fine-tune generative models, aligning outputs with human intent and quality expectations.

RLVR (Reinforcement Learning from Verifiable Rewards)

An RL approach using rewards that can be automatically verified (e.g., spatial overlap, trajectory shape) rather than human judgments.

RLVR (Reinforcement Learning with Verifiable Reward)

A reinforcement learning approach where rewards are computed from verifiable criteria (e.g., correct timestamps, spatial accuracy) rather than learned reward models.

RLVR (Reinforcement Learning with Verifiable Rewards)

A post-training paradigm where models are optimized using rewards that can be automatically verified (e.g., correct math answer) rather than requiring human judgment.

RoPE (Rotary Positional Embedding)

A positional encoding method that uses rotation matrices to encode position information, allowing models to generalize to different sequence lengths and spatial dimensions.

RoPE (Rotary Positional Embeddings)

A positional encoding method that encodes position information through rotation of feature vectors, supporting variable sequence lengths and, in 2D form, variable image resolutions.

RS (Remote Sensing)

The science of acquiring information about the Earth's surface from satellite, aerial, or drone-based sensors without direct physical contact.

RSSM (Recurrent State-Space Model)

A generative model combining deterministic and stochastic components to learn latent dynamics from sequential data, widely used in model-based reinforcement learning.

RSVQA

Remote Sensing Visual Question Answering — a task requiring models to answer natural language questions about satellite or aerial images.

Safety Alignment

The process of training or configuring a model to refuse harmful requests, avoid generating dangerous content, and behave in accordance with human values and safety policies.

SAM (Segment Anything Model)

A foundation model for image segmentation developed by Meta, trained on over 1 billion masks, that can segment any object in an image given point, box, or text prompts.

SAR (Synthetic Aperture Radar)

An active imaging sensor that emits microwave pulses to create images regardless of weather or lighting conditions, producing images that look very different from optical photos.

ScanObjectNN

A benchmark dataset of real-world 3D scanned objects with background clutter, used to evaluate point cloud classification models in realistic conditions.

ScanRefer

A benchmark for 3D visual grounding that requires localizing objects in indoor 3D scenes (ScanNet) based on natural language descriptions, evaluated by accuracy at specific IoU thresholds.

Scene Graph

A structured representation of an image as a graph where nodes represent objects and edges represent relationships between them, used in CompreCap for hierarchical caption evaluation.

ScreenSpot-Pro

A professional-level benchmark for GUI grounding that tests a model's ability to precisely locate interface elements from natural language instructions across diverse software applications.

SDE (Stochastic Differential Equation)

A mathematical formulation of diffusion that includes random noise at each step, enabling stochastic exploration of generation paths (needed for RL).

SDS (Score Distillation Sampling)

A technique that distills knowledge from a pre-trained diffusion model's score function into another model or representation, enabling optimization without full sampling.

SEER (Self-Evolving Evaluator and Reprompter)

A mechanism in Endogenous Reprompting that trains a model to first be a verifiable evaluator (via RLVR), then uses that evaluator to train the model to rewrite generation prompts (via RLMT).

Set-of-Mark (SoM)

A visual prompting technique that overlays numbered markers on UI elements or scene objects, allowing vision-language models to reference specific locations by ID rather than predicting coordinates.

SFT (Supervised Fine-Tuning)

Training a pre-trained model on task-specific labeled data to adapt it for downstream applications, typically using teacher-forced next-token prediction.

Sharpness-Aware Minimization (SAM)

An optimization technique that seeks parameter regions where the loss is uniformly low (flat minima), improving generalization by avoiding sharp, brittle minima.

SigLIP

Sigmoid Loss for Image-text Pre-training — a contrastive learning method that trains vision-language models using a sigmoid loss on image-text pairs, showing superior localization performance compared to classification-pretrained encoders.

SigLIP (Sigmoid Loss for Language Image Pre-training)

A contrastive learning variant that uses sigmoid loss instead of softmax, enabling effective training without requiring massive batch sizes for negative mining.

SigLIP (Sigmoid Loss for Language-Image Pre-training)

An improved contrastive vision-language pretraining method using sigmoid loss instead of softmax, showing superior localization performance and becoming the standard backbone for grounding tasks.

SigLIP (Sigmoid Loss for Language-Image Pretraining)

A vision-language pretraining approach that uses a sigmoid loss function instead of softmax contrastive loss, enabling more efficient batch processing.

Sim-to-Real Transfer

The process of deploying policies trained in simulation to real-world robots, often challenged by visual and dynamic mismatches between the two domains.

SimplerEnv

A simplified simulation environment for evaluating robotic manipulation policies, testing visual matching and precision control capabilities.

SLAM (Simultaneous Localization and Mapping)

The computational problem of constructing a map of an unknown environment while simultaneously tracking the agent's location within it.

SMPL-X / FLAME

Parametric 3D body (SMPL-X) and face (FLAME) models that represent human shape and pose as compact parameter vectors, widely used as output formats in animation generation.

SO(3)

The special orthogonal group in 3 dimensions — the mathematical group of all 3D rotations, used to represent orientations in generative models for molecular and 3D structure design.

SOC (Stochastic Optimal Control)

A mathematical framework for finding optimal control policies in systems with random dynamics, applied here to steer diffusion generation toward target styles.

Softmax

A function that converts raw attention scores into a probability distribution; its outputs follow a power-law distribution with many near-zero values and few large values, making quantization difficult.

Spherical Harmonics

Mathematical functions defined on the surface of a sphere, used in 3D graphics to efficiently represent directional lighting and view-dependent appearance effects such as specular reflections.

SPL (Success weighted by Path Length)

A navigation metric that rewards successful episodes proportionally to their path efficiency relative to the optimal path, penalizing unnecessary wandering.

SQNR (Signal-to-Quantization-Noise Ratio)

A metric measuring how much a layer's output is corrupted by quantization noise, used as a proxy for layer sensitivity in mixed-precision quantization.

STN (Spatial Transformer Network)

A neural network module that learns to spatially transform feature maps, enabling the model to focus on relevant regions and become robust to geometric variations.

STSG (Spatial-Temporal Scene Graph)

A structured representation of a video scene as a graph where nodes represent objects and edges represent spatial and temporal relationships between them across frames.

SVD (Singular Value Decomposition)

A matrix factorization technique that decomposes a matrix into three components (U, Σ, V), used in PEFT methods to identify and adapt the most important directions in weight space.

Swap Attack

An adversarial technique where authentic watermarked video is paired with malicious deepfake audio, exploiting watermarking systems that treat modalities independently.

Swin Transformer

A hierarchical Vision Transformer that computes attention within shifted local windows, enabling efficient processing of high-resolution images for dense prediction tasks like detection and segmentation.

Sycophancy

A model behavior where it prioritizes agreeing with or pleasing the user over providing accurate or truthful responses, even when the user's statements contradict visual evidence.

T-GRPO (Temporal Group Relative Policy Optimization)

An extension of GRPO that adds contrastive temporal rewards, comparing model answers on ordered versus shuffled video frames to enforce temporal understanding.

T2I-CompBench

A comprehensive benchmark for fine-grained compositional text-to-image generation, evaluating color, shape, texture, spatial, and complex attribute binding.

TD Learning (Temporal Difference Learning)

An RL technique where value estimates are updated based on the difference between consecutive predictions, used in VisVM to train caption quality estimators without full rollouts.

TEDS (Tree Edit Distance Similarity)

A metric for evaluating table recognition quality by computing the structural similarity between predicted and ground-truth table HTML/Markdown as tree structures.

Temporal IoU (Intersection over Union)

A metric measuring the overlap between predicted and ground-truth time intervals in a video, used as a reward signal for temporal localization tasks.

Temporal Video Grounding (TVG)

The task of localizing specific temporal segments (start and end timestamps) in a video that correspond to a natural language query.

Test-Time Scaling

Techniques that allocate additional computation during inference (e.g., generating multiple candidate answers and voting) to improve output quality.

Test-Time Scaling (TTS)

A paradigm where additional computation is allocated at inference time — through search, sampling, or verification — to improve output quality without retraining the model.

Token Merging

An inference acceleration technique that identifies and combines similar tokens in the transformer, reducing computational cost without retraining.

Token Pruning

A technique to reduce computational cost by selectively removing redundant visual tokens from the input sequence before processing by the language model.

Token Pruning / Compression

Techniques that reduce the number of visual tokens processed by the language model by removing redundant patches (typically background regions), achieving 50-89% compute reduction with minimal accuracy loss.

Topological Map/Graph

A graph-based spatial representation where nodes represent places or viewpoints and edges represent navigable connections, used for high-level navigation planning.

TPO (Trajectory Preference Optimization)

An extension of preference optimization applied to entire robot trajectories rather than individual tokens, used for fine-tuning VLA models.

Trajectory

A sequence of (observation, action, result) tuples representing an agent's complete interaction history while performing a task.

TriAtt-CoT

A multi-head attention mechanism for reward models introduced in SVIP that attends to three distinct dimensions of reasoning quality: Relevance, Logic, and Attribute.

TTFT (Time-to-First-Token)

The latency between providing input to a language model and receiving the first generated token, a key metric for interactive applications.

TTS (Test-Time Scaling)

The practice of allocating additional computational resources at inference time to improve output quality, such as generating multiple candidates, iterating through verification loops, or extending reasoning chains.

TTS (Text-to-Speech)

Technology that converts written text into spoken audio. End-to-end omni-modal models aim to replace external TTS modules with native speech generation.

Typography-Based Attack

A multimodal jailbreak technique that renders harmful text as an image (e.g., writing 'bomb' in a picture), bypassing text-based safety filters via the visual encoder.

UAR (Unweighted Average Recall)

A classification metric that averages recall across all classes equally regardless of class frequency, commonly used in emotion recognition to handle class imbalance.

UDRL (Upside-Down Reinforcement Learning)

A reinforcement learning paradigm that conditions the model on desired outcomes as inputs rather than optimizing a reward function directly, enabling more controllable generation.

UHR (Ultra-High Resolution)

Satellite or aerial images with very fine spatial resolution (sub-meter to few meters per pixel), containing enormous pixel counts that challenge standard model input sizes.

Umwelt

A concept from cognitive science referring to an organism's subjective internal model of its environment, used in TIWM to describe a world model that captures task-relevant features rather than full scene fidelity.

Unified Multimodal Model (UMM)

A single neural network that can both understand (perceive, reason about) and generate (create) content across multiple modalities such as text, images, and audio, using shared parameters.

UniGRPO

A variant of GRPO adapted for discrete diffusion models that approximates sequence likelihoods via structured masking to enable policy-gradient optimization of complex rewards.

V*Bench

A benchmark for fine-grained visual search and spatial reasoning that requires models to actively explore complex scenes to find small or partially occluded target objects.

VAE (Variational Autoencoder)

A neural network that encodes images into a compact latent space and decodes them back, used in diffusion models to operate in a lower-dimensional space for efficiency.

VBench

A comprehensive benchmark for evaluating text-to-video generation across multiple dimensions including visual quality, temporal consistency, motion smoothness, dynamic degree, and text-video alignment.

VCoT / Visual CoT (Visual Chain-of-Thought)

Extension of CoT that incorporates visual artifacts—generated images, crops, latent tokens, or diagrams—as intermediate reasoning steps alongside text.

Video-LMM (Video Large Multimodal Model)

A model that integrates a visual encoder with a decoder-based large language model to understand and reason about video content across visual and textual modalities.

Video-MME

A benchmark for evaluating multi-modal video understanding across diverse question types including temporal, causal, and descriptive reasoning.

VideoAlign

A benchmark measuring how well generated videos align with human preferences across motion quality, temporal coherence, and visual aesthetics dimensions.

VideoQG (Video Question Generation)

The task of automatically generating relevant questions about video content, used for assessing and facilitating video comprehension.

VIE (Visual Information Extraction)

The task of extracting structured data (like JSON key-value pairs) from document images, combining OCR with semantic understanding of document layout.

Vision Transformer (ViT)

A neural network architecture that applies the Transformer's self-attention mechanism to image patches, treating each patch as a 'token' analogous to words in natural language processing.

Visual Chain-of-Thought (Visual CoT)

A reasoning paradigm where models interleave textual analysis steps with explicit visual operations — such as outputting bounding box coordinates or cropping image regions — to anchor reasoning in spatial evidence.

Visual Grounding

The ability of a model to connect its textual outputs to specific regions or objects in an image, often measured by whether bounding box predictions align with described entities.

Visual Prompt Tuning (VPT)

A method that prepends learnable token vectors to the input sequence of a frozen vision transformer, allowing task adaptation without modifying the model's core weights.

ViT (Vision Transformer)

A transformer architecture applied directly to sequences of image patches for visual recognition, treating images as sequences of tokens.

VL-RewardBench

A benchmark for evaluating multimodal reward models on their ability to correctly rank preferred versus rejected vision-language responses.

VLA (Vision-Language-Action Model)

An extension of VLMs that additionally generates physical actions (robot trajectories, driving commands), bridging perception, reasoning, and motor control.

VLA (Vision-Language-Action)

A model architecture that unifies visual perception, language understanding, and physical action generation in a single framework, enabling end-to-end control from sensory inputs.

VLA (Vision-Language-Action) Model

A multimodal model that processes visual observations and language instructions to generate robotic actions, bridging perception and physical manipulation.

VLA (Vision-Language-Action) Models

Models that integrate visual perception and language understanding to generate physical robot actions, bridging multimodal reasoning with embodied control.

VLM (Vision-Language Model)

A model that jointly processes visual and textual inputs to perform tasks like image captioning, visual question answering, or cross-modal retrieval.

VLN (Vision-Language Navigation)

A task where an agent must navigate an environment by following natural language instructions, requiring joint understanding of visual scenes and language.

VLN-CE (VLN in Continuous Environments)

A more challenging variant of VLN where the agent operates in continuous 3D space rather than on pre-defined navigation graphs with discrete waypoints.

VLP (Vision-Language Pretraining)

The process of pretraining models on paired visual and textual data to learn cross-modal representations before fine-tuning on specific downstream tasks.

VPQ (Video Panoptic Quality)

A metric for evaluating panoptic segmentation consistency across video frames, measuring both recognition quality and temporal tracking accuracy.

VQ (Vector Quantization)

A compression method that maps continuous weight vectors to entries in a learned codebook, enabling extreme compression (e.g., 2-bit) by storing only codebook indices.

VQ-GAN (Vector Quantized Generative Adversarial Network)

A generative model that encodes images into discrete tokens using a learned codebook, enabling transformer-based generation of visual content.

VQ-VAE (Vector Quantized Variational Autoencoder)

A generative model that learns discrete latent representations by mapping continuous data to a finite codebook of learned vectors, enabling tokenization of complex inputs like floor plans.

VQA (Visual Question Answering)

A task requiring models to answer natural language questions about visual inputs (images or videos) by integrating perception and reasoning.

VQAv2 (Visual Question Answering v2)

A benchmark requiring models to answer open-ended questions about images, designed to minimize language bias through balanced question pairs.

VSI-Bench

A benchmark for evaluating video spatial intelligence understanding in Video-LLMs.

VTG (Video Temporal Grounding)

The task of identifying specific time intervals in a video that correspond to a textual query, including moment retrieval and highlight detection.

Wasserstein Gradient Flow

A mathematical framework describing how probability distributions evolve over time to minimize a given divergence in the space of probability measures, providing theoretical foundations for particle-based generative models.

WER (Word Error Rate)

A metric for ASR performance measuring the proportion of incorrectly transcribed words. Lower is better, with state-of-the-art systems achieving 3-5% on standard benchmarks.

WiFi CSI (Channel State Information)

Fine-grained wireless signal measurements that capture how WiFi signals are affected by the environment, enabling device-free human sensing through signal distortion patterns.

World Model

A learned internal model of environment dynamics that allows an agent to predict future states from actions, enabling planning via 'imagination' without real-world interaction.

WSI (Whole Slide Image)

A gigapixel-scale digital scan of a tissue specimen on a glass slide, used in computational pathology for diagnosis and research.

X-CoT (Cross-Modal Chain-of-Thought)

A reasoning strategy that decomposes multimodal tasks into a reasoning phase (understanding reference inputs) and a generation phase, producing intermediate text and image outputs before final results.

Zero-initialized Gating

A technique where adapter or prompt contributions are initially set to zero, ensuring the adapted model starts behaving identically to the frozen base model and gradually learns task-specific adjustments.

Zero-Shot Classification

The ability to classify inputs into categories never seen during training, typically by leveraging learned cross-modal representations to match inputs with text descriptions of novel classes.

Zero-shot Transfer

The ability of a model to perform a task it was never explicitly trained on, relying on knowledge learned during pretraining to generalize to new domains.

Method	Key Innovation	Improves On	Papers
RL-Based Multimodal Reasoning	Use outcome-based RL rewards to incentivize step-by-step reasoning in multimodal models, bypassing the need for annotated reasoning traces.	Improves on standard SFT by +22.1% accuracy on ScreenSpot (UI-R1) and matches 32B model performance with 8B parameters (ContextRL); MobileRL-9B achieves 80.2% on AndroidWorld vs 64.2% prior SOTA.	Kimi k1.5 (2025), SophiaVL-R1 (2025), MobileRL (2025), ContextRL (2026)
Multimodal Safety Attack & Defense	Exploit the vision-language connector as a weak point in safety alignment by encoding harmful content in images, audio, or typographic text.	VoiceJailbreak increases GPT-4o attack success rate from 0.033 to 0.778; Typography-based attacks raise ASR on LLaVA by 30%+ over text-only baselines; BadMerging achieves >90% ASR where prior methods fail at <20%.	Voice Jailbreak Attacks Against GPT-4o (2024), MM-SafetyBench (2023), BadMerging (2024), OmniSafeBench-MM (2025)
Native Multimodal Pre-training	Jointly acquire visual and linguistic capabilities during a single pre-training stage rather than retrofitting a text-only LLM with a vision encoder.	InternVL3-78B achieves 72.2 on MMMU, setting SOTA for open-source MLLMs; Gemini Ultra scores 90.04% on MMLU, first to exceed human-expert performance (89.8%); Kimi K2.5 achieves 86.4% on GPQA-Diamond.	Gemini (2023), InternVL3 (2025), Kimi K2.5 (2026), ShareGPT4V (2023)
Audio-Language Reasoning Models	Apply structured reasoning frameworks (CoT, GRPO) to audio inputs, training models to plan, caption, and reason before answering complex audio questions.	SARI achieves 67.08% on MMAU, +16.35% over Qwen2-Audio base; Omni-R1 reaches 71.3% MMAU SOTA; Audio Flamingo Sound-CoT achieves 79.83% on MMAU-Sound vs GPT-4o Audio at 63.20%.	SARI (2025), Audio Flamingo 2 (2025), Audio Flamingo Sound-CoT Technical Report (2025), AHELM (2025)
Universal Medical Multimodal Models	Fine-tune or adapt general-purpose foundation models on large-scale curated medical datasets to enable cross-modality and cross-task generalization with prompt-based interfaces.	MedSAM outperforms specialist U-Net by 15.5% on unseen nasopharynx cancer segmentation; PRISM fine-tuned on 10% data outperforms supervised baselines using 100% data; MedRAX surpasses GPT-4o alone on complex clinical reasoning.	Segment Anything in Medical Images (2023), PRISM (2024), MedRAX (2025), PathMem (2026)

Benchmark	Metric	Best Result	Paper
MMLU	Accuracy (%)	90.04%	Gemini (2023)
MMMU	Accuracy (%)	72.2%	InternVL3 (2025)
AndroidWorld	Success Rate (%)	80.2%	MobileRL (2025)
MMAU (Multimodal Audio Understanding)	Accuracy (%)	71.3%	Omni-R1 (2025)
MathVista	Accuracy (%)	71.3%	SophiaVL-R1 (2025)