π What is Multi-Modal LLMs?
Multi-Modal LLMs research develops AI systems that jointly perceive, reason about, and generate content across vision, language, audio, and embodied action modalities.
π‘ Why it Matters
Bridging the gap between human-like multimodal cognition and current AI is essential for trustworthy visual assistants, autonomous systems, creative tools, and scientific discovery.
π― Key Paradigms
Enabling models to answer questions about images, locate objects from text descriptions, read documents, and generate image captions by integrating visual perception with language reasoning
Synthesizing high-fidelity images, videos, and edits from text descriptions using diffusion models, flow matching, and autoregressive transformers aligned with human preferences via reinforcement learning
Processing temporal dynamics in video through question answering, temporal grounding, and causal reasoning with memory-augmented architectures and tool-augmented agents
Enhancing logical inference over visual inputs while reducing hallucinations through chain-of-thought reasoning, process reward models, and preference-based alignment
Designing efficient visual encoders, compressing tokens, and optimizing multimodal pretraining to enable deployment on resource-constrained devices without sacrificing accuracy
Unifying perception, language, and physical action in vision-language-action models for robotic manipulation, autonomous driving, and world simulation
π Field Evolution Timeline
Establishing multimodal architectures, early reasoning paradigms, and foundational benchmarks
- LLaVA established the visual instruction tuning paradigm connecting CLIP to LLMs (Visual Instruction Tuning, 2023)
- Multimodal Chain-of-Thought pioneered two-stage reasoning for VQA, achieving 85.31% on ScienceQA with a sub-1B model
- ViperGPT created the visual programming paradigm translating queries to executable Python code
- MMMU benchmark revealed a 33-point gap between GPT-4V and human experts on college-level multimodal questions
- CM3Leon proved autoregressive models can rival diffusion for image generation with 5x less compute
- AnimateDiff introduced plug-and-play motion modules to animate personalized text-to-image models
Large-scale preference alignment, benchmark proliferation, efficient architectures, and zero-shot personalization
- InstantID enabled plug-and-play face personalization using face recognition embeddings without fine-tuning
- PaLI-X jointly scaled vision (ViT-22B) and language (32B) components to achieve 86.0 on VQAv2
- Video-MME established the comprehensive benchmark for video LLM evaluation across all durations
- T2V-Turbo-v2 achieved 85.13 VBench score, surpassing commercial systems Gen-3 and Kling
- Preference optimization with just 5K samples was shown to reverse language degradation from visual fine-tuning
- Flow matching emerged as the dominant action generation paradigm for robotic VLA models (Οβ)
GRPO-based reinforcement learning transforms all sub-fields from understanding to generation to robotics
- R1-Zero replicated emergent reasoning ('aha moment') in multimodal models via GRPO without supervised fine-tuning
- Flow-GRPO boosted SD3.5-M GenEval from 63% to 95%, spawning 30+ GRPO variants for visual generation
- Visual-RFT extended R1-style RL to visual tasks, improving COCO detection mAP from 9.8 to 31.3
- SimpleVLA-RL achieved 91.7% on LIBERO-Long from a single demonstration via GRPO adaptation
- RL-100 achieved 100% success across 1000 real-world robot evaluations and 7-hour continuous operation
- olmOCR made large-scale PDF processing economically viable at $176 per million pages via distillation
Sub-1B deployment models, agentic self-improvement, spatial intelligence, and automated content production
- GLM-OCR ranked first on OmniDocBench v1.5 with a 0.9B model using multi-token prediction
- VLA-Thinker introduced perception as a dynamically invocable reasoning action, tripling long-horizon success
- World2Mind achieved +17.6% on VSI-Bench via training-free allocentric spatial reasoning
- AdaReasoner achieved 97.6% on spatial planning via RL-trained adaptive tool orchestration
- RubiCap's 7B model outperformed GPT-4V on dense captioning via rubric-guided RL
Vision-Language Understanding
What: Research on models that jointly process visual and textual information to understand images, videos, and multimodal content for reasoning, generation, and decision-making.
Why: Bridging vision and language enables AI systems to perceive, reason about, and act upon the visual world using natural language instructions.
Baseline: Early Vision-Language Models (VLMs) like CLIP align image-text pairs via contrastive learning, then decode with a frozen Large Language Model using fixed-resolution visual tokens.
- VLMs hallucinate content not present in images, prioritizing language priors over visual evidence
- Fixed-resolution processing destroys fine-grained detail and prevents unified image-video understanding
- Text-only Chain-of-Thought reasoning cannot actively inspect visual details like small objects or specific video frames
π§ͺ Running Example
Baseline: A standard VLM resizes the high-resolution menu image to 336Γ336 pixels, destroying fine text. It hallucinates plausible-sounding prices from language priors rather than reading the actual numbers, producing an incorrect total.
Challenge: This example illustrates three key challenges: (1) fixed-resolution processing loses critical text detail; (2) the model hallucinates prices it cannot read; (3) pure text reasoning cannot zoom into specific regions to verify numbers.
π Overall Progress
The field has undergone three major paradigm shifts: from fixed-resolution contrastive models (CLIP era) to dynamic-resolution architectures (Qwen2-VL, NVILA), from supervised fine-tuning to reinforcement learning-based reasoning (ThinkLite-VL, Pixel Reasoner), and from passive perception to active agentic behavior with tool use and self-evolution (OpenThinkIMG, MM-Zero). Joint visual-textual reasoning now rivals or surpasses human-level performance on specific benchmarks, while 7B-parameter models routinely outperform GPT-4o on targeted tasks.
π Sub-topics
VLM Architecture & Efficient Training
180 papers
Core model architectures for vision-language understanding, including dynamic resolution handling, efficient token compression, and novel training paradigms like diffusion-based VLMs and data-efficient curation.
Visual Reasoning & Chain-of-Thought
150 papers
Methods that enhance VLMs' reasoning capabilities through reinforcement learning, visual chain-of-thought, spatial reasoning, and mathematical problem solving with active visual inspection.
Embodied AI & Robotics
130 papers
Vision-Language-Action (VLA) models for robotic manipulation, autonomous driving, and embodied navigation, leveraging VLMs for task planning, spatial understanding, and closed-loop control.
Evaluation & Benchmarks
120 papers
Benchmarks and evaluation frameworks that systematically assess VLM capabilities, expose failure modes like data leakage and position bias, and measure progress in spatial reasoning, safety, and domain-specific tasks.
Safety, Alignment & Hallucination Mitigation
100 papers
Research addressing VLM reliability, including hallucination detection and mitigation, adversarial robustness, reward modeling for human alignment, and safety against jailbreak attacks.
Video Understanding & Temporal Reasoning
71 papers
Long video comprehension, streaming video reasoning, and temporal understanding using VLMs, addressing challenges of context length, frame selection, and temporal consistency.
π‘ Key Insights
π‘ Reinforcement learning post-training enables 7B models to surpass GPT-4o on visual reasoning benchmarks.
π‘ Fixed-resolution encoding destroys critical visual detail; dynamic resolution yields 30%+ accuracy gains on text-heavy tasks.
π‘ Most VLM benchmarks contain samples solvable without images, inflating reported capabilities.
π‘ Pixel-space reasoning with active visual tools outperforms text-only chain-of-thought on fine-grained perception.
π‘ Data curation with 5% of training data can match or exceed full-dataset performance when guided by capability analysis.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from building foundational architectures (2023) through rigorous evaluation and efficiency optimization (2024) to RL-driven reasoning breakthroughs and agentic self-improvement (2025-2026), with an accelerating trend toward embodied applications and zero-data self-evolution.
- POPE (Evaluating Object Hallucination in Large..., 2023) established the foundational hallucination evaluation benchmark, revealing that LVLMs answer 'yes' to 99% of object queries
- (VoxPoser, 2023) pioneered LLM-synthesized 3D value maps for zero-shot robotic manipulation
- (MMBench, 2023) introduced circular evaluation and LLM-based choice extraction for robust VLM assessment
- (Beyond Hallucinations, 2023) reframed hallucination elimination as a preference optimization task
- MMStar (Are We on the Right..., 2024) revealed that models like GeminiPro outperform random choice by 24% without accessing any visual input, exposing severe data leakage in benchmarks
- Qwen2-(Qwen2-VL, 2024) introduced Naive Dynamic Resolution with M-RoPE, achieving 93.8% on DocVQA and surpassing GPT-4o on MathVista by 6.7%
- (NVILA, 2024) demonstrated scale-then-compress architecture reducing training costs by 1.9-5.1x while matching leading open VLMs
- (LongVILA, 2024) scaled long-context VLMs to handle long videos through distributed training innovations
- (VisionArena, 2024) collected 230K real-world user-VLM conversations with preference labels for human-aligned evaluation
π Shift from fixed-resolution visual encoding to dynamic, native-resolution processing (Qwen2-VL, NVILA), enabling unified image-video understanding.
- (SoTA, 2025) achieved 75.1% on MathVista with a 7B model using only 11k samples via MCTS-guided reinforcement fine-tuning, surpassing GPT-4o (63.8%)
- (Pixel Reasoner, 2025) introduced pixel-space reasoning with curiosity-driven RL, outperforming Gemini-2.5-Pro on V* Bench (84.3% vs 79.2%)
- (Capability-Attributed, 2025) surpassed full-data training using only 5% of data by analyzing intrinsic model capabilities
- (EditReward, 2025) achieved 65.72% accuracy on GenAI-Bench, outperforming GPT-5 (59.61%) for image editing reward modeling
- (Safety at Scale, 2025) provided the first unified safety taxonomy across modalities, analyzing 574 papers
π Reinforcement learning emerged as the dominant post-training paradigm, enabling VLMs to reason in pixel space, use visual tools, and self-improve with minimal data.
- (MM-Zero, 2026) demonstrated self-evolving VLMs from zero data using a tri-role Proposer-Coder-Solver framework
- SGCoT (Can VLMs Solve the Shell Game?, 2026) exposed that frontier VLMs perform at random chance on entity tracking and introduced Spatiotemporal Grounded Chain-of-Thought achieving >90% accuracy
- (DatBench, 2026) achieved 13x evaluation speedup while revealing 35-point accuracy drops when converting from multiple-choice to generative evaluation
- (Meissa, 2026) matched proprietary frontier agents in 10 of 16 medical settings while being 22x faster and using 25x fewer parameters
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Dynamic Resolution VLM Architectures | Replace fixed-resolution visual encoding with dynamic patch sequences and multimodal positional embeddings (e.g., M-RoPE) that decompose into spatial and temporal dimensions. | Improves on fixed-resolution baselines by +6.7% on MathVista (Qwen2-VL-72B vs GPT-4o) and achieves 93.8% on DocVQA, setting new state-of-the-art for document understanding. | Qwen2-VL (2025), NVILA (2024), Pixtral 12B (2024), What matters when building vision-language... (2024) |
| Reinforcement Learning for Visual Reasoning | Use reinforcement learning with verifiable rewards to train VLMs to reason step-by-step, combining text-based chain-of-thought with active visual inspection tools. | ThinkLite-VL-7B achieves 75.1% on MathVista, surpassing GPT-4o (63.8%) and Qwen2.5-VL-72B (71.9%) using only 11k training samples. | SoTA with Less (2025), Pixel Reasoner (2025), OpenThinkIMG (2025), DualMindVLM (2025) |
| Vision-Language-Action Models for Embodied AI | Bridge perception and action by using VLMs as high-level planners that output structured action representations (3D value maps, meta-actions, waypoints) for low-level controllers. | VoxPoser enables zero-shot robotic manipulation from language commands; SimLingo achieves state-of-the-art closed-loop autonomous driving with language-action alignment. | VoxPoser (2023), SimLingo (2025), Poutine (2025), Interactive Post-Training for Vision-Language-Action Models (2025) |
| Hallucination Mitigation & Human Alignment | Train models to prefer visually grounded outputs over plausible-sounding fabrications using preference pairs, or identify and suppress attention heads that copy incorrect prompt information. | HA-DPO improves MiniGPT-4 POPE accuracy from 51.13% to 86.13% (+35 points); POVID reduces CHAIR hallucination score from 66.8 to 31.8 on LLaVA-1.5. | Evaluating Object Hallucination in Large... (2023), Beyond Hallucinations (2023), Aligning Modalities in Vision Large... (2024), Mechanisms of Prompt-Induced Hallucination in... (2026) |
| Visual Token Compression & Efficient Inference | Select the most informative visual tokens using object-centric attention, optimal transport, or context-aware resolution prediction, then discard redundant tokens before language model processing. | OC-VTP retains only 11.1% of visual tokens on LLaVA-1.5 while maintaining 95.5% performance, achieving 17x reduction in prefill FLOPs. | OC-VTP (2025), PACT (2025), VLM-Pruner (2025), InfiniteVL (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MathVista | Accuracy (%) | 79.7% | SoTA with Less (2025) |
| DocVQA | Accuracy (%) | 93.8% | Qwen2-VL (2025) |
| V* Bench | Accuracy (%) | 84.3% | Pixel Reasoner (2025) |
| POPE (Adversarial) | F1 Score | 86.13% accuracy (HA-DPO on MiniGPT-4) | Beyond Hallucinations (2023) |
| MMStar | Accuracy (%) | 57.1% | Are We on the Right... (2024) |
β οΈ Known Limitations (4)
- VLMs systematically hallucinate by prioritizing language priors over visual evidence, especially for high-count objects and specific prompt phrasings, undermining reliability in safety-critical applications. (affects: Dynamic Resolution VLM Architectures, Reinforcement Learning for Visual Reasoning)
Potential fix: Identify and ablate specific prompt-copying attention heads (PIH-heads); use preference optimization with synthetic negative examples (HA-DPO, POVID); apply contrastive decoding to suppress outlier token attention (DAMRO). - VLMs lack genuine spatial reasoning and object permanence, performing near random chance on entity tracking tasks when visual shortcuts are removed. (affects: Dynamic Resolution VLM Architectures, Vision-Language-Action Models for Embodied AI)
Potential fix: Spatiotemporal Grounded Chain-of-Thought (SGCoT) forces explicit coordinate tracking; blueprint-based spatial reasoning constructs structured representations before answering; external 3D tools (depth estimation, point clouds) augment visual understanding. - Benchmark evaluations are often unreliable due to data leakage, multiple-choice format inflation, and text-solvable questions, making it difficult to measure true VLM progress. (affects: Dynamic Resolution VLM Architectures, Reinforcement Learning for Visual Reasoning)
Potential fix: Use generative evaluation instead of multiple-choice (DatBench); filter benchmarks for visual necessity (MMStar); employ circular evaluation to detect position bias (MMBench); collect real-world user preference data (VisionArena). - Visual token processing accounts for 95-99% of compute in VLMs, creating a severe efficiency bottleneck for high-resolution and long-video inputs that limits deployment on edge devices. (affects: Dynamic Resolution VLM Architectures, Visual Token Compression & Efficient Inference)
Potential fix: Object-centric token pruning (OC-VTP) retains 11% of tokens at 95.5% performance; context-aware resolution selection (CARES) reduces tokens by 70-80%; hybrid linear-attention architectures (InfiniteVL) achieve constant-memory streaming inference.
π View major papers in this topic (10)
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2025-05) 9
- SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement (2025-04) 9
- Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning (2025-05) 9
- NVILA: Efficient Visual Language Models from Pre-training to Deployment (2024-12) 9
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models (2023-07) 9
- Are We on the Right Way for Evaluating Large Vision-Language Models? (2024-04) 9
- CADC: Capability-Attributed Data Curation (2025-10) 9
- EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing (2025-09) 9
- Can Vision-Language Models Solve the Shell Game? (2026-03) 9
- OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning (2025-05) 9
π‘ Diving deeper into Vision-Language Understanding, let's examine specific research threads that define this area.
Visual Question Answering
What: Visual Question Answering (VQA) requires models to answer natural language questions about visual inputs by integrating perception, reasoning, and external knowledge.
Why: Enabling machines to understand and reason about visual content is critical for accessibility, autonomous systems, medical diagnosis, and human-AI interaction.
Baseline: Standard Vision-Language Models encode images globally with a frozen vision encoder and generate answers directly via a language model without structured reasoning.
- Complex multi-step reasoning requiring compositional logic, spatial understanding, and domain-specific knowledge integration
- Hallucination and robustness failures where models generate plausible but incorrect answers based on language priors rather than visual evidence
- Scaling to high-resolution, multi-image, and long-video inputs while maintaining computational efficiency on resource-constrained devices
π§ͺ Running Example
Baseline: A standard VLM processes the receipt at low resolution, missing small text. It may hallucinate a plausible total based on common receipt patterns rather than reading the actual numbers, or fail to identify which items are circled.
Challenge: This example requires OCR (reading small text), spatial reasoning (identifying circled items), and arithmetic (summing prices) β a composition of perception, localization, and multi-step logic that baseline models handle poorly.
π Overall Progress
Visual Question Answering has undergone three major paradigm shifts: from end-to-end neural models (2017-2022) to compositional program execution and structured reasoning (2023-2024), and most recently to reinforcement learning-driven emergent reasoning without human supervision (2025-2026). The field has simultaneously expanded from generic VQA to expert-level domain-specific applications, while benchmark difficulty has escalated dramatically β top models still trail humans by 20-30% on challenging benchmarks like MMMU and MME-RealWorld.
π Sub-topics
Chain-of-Thought and Multi-Step Reasoning
30 papers
Methods that decompose visual question answering into structured, multi-step reasoning processes β including Chain-of-Thought prompting, stage-wise generation, and visual grounding of intermediate reasoning steps.
Reinforcement Learning for Visual Reasoning
22 papers
Approaches using reinforcement learning β particularly Group Relative Policy Optimization (GRPO) and verifiable rewards β to elicit reasoning capabilities in VLMs without requiring human-annotated reasoning traces.
Domain-Specific Visual Question Answering
45 papers
Adapting VQA to specialized domains including medical imaging, autonomous driving, remote sensing, and agriculture, where general-purpose models lack domain knowledge and fine-grained perception.
VQA Benchmarks and Model Evaluation
38 papers
Creation of comprehensive benchmarks for evaluating VLMs across diverse dimensions including expert-level reasoning, cultural understanding, spatial awareness, factuality, and robustness to adversarial inputs.
Efficient Architectures and Training Recipes
30 papers
Innovations in VLM architecture design β including high-resolution processing, Mixture-of-Experts, hybrid Mamba-Transformer models, and data-centric training strategies β to improve capability and efficiency.
Visual Programming and Tool-Augmented VQA
12 papers
Methods that translate visual queries into executable programs or invoke specialized tools, leveraging code LLMs as reasoning engines and pre-trained vision models as perception modules.
Robustness, Safety, and Hallucination Mitigation
17 papers
Research on understanding and mitigating VLM failures including adversarial vulnerability, visual hallucinations, sycophancy, spurious biases, and developing safety guardrails for multimodal content.
π‘ Key Insights
π‘ Reinforcement learning elicits emergent visual reasoning without human-annotated traces
π‘ Even top models trail human experts by 20-30% on expert-level benchmarks
π‘ Small 2-3B models can outperform 70B+ models when trained with RL-based reasoning
π‘ Visual programming via code generation enables zero-shot compositional reasoning
π‘ Chain-of-Thought prompting sometimes degrades performance on spatial and visual tasks
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from monolithic predict-the-answer models toward modular, reasoning-aware systems that decompose perception from logic. The dominant trend in 2025-2026 is replacing supervised fine-tuning with reinforcement learning (GRPO) to elicit reasoning, alongside increasing specialization for high-stakes domains like medicine and autonomous driving.
- (MemexQA, 2017) introduced personal photo collection QA, treating VQA as retrieval-then-inference over dynamic multimodal collections
- (Multimodal Chain-of-Thought Reasoning, 2023) pioneered two-stage rationale generation for visual reasoning, achieving 85.31% on ScienceQA with a sub-1B model
- ViperGPT (Visual Inference via Python Execution, 2023) established the visual programming paradigm, translating queries to executable Python code with pre-trained vision tools
- (Multi-Modal, 2023) introduced multi-modal in-context examples at scale, creating the Otter model with highest human-evaluated Elo rating
- MMMU (Massive Multi-discipline Multimodal Understanding, 2023) created the definitive expert-level benchmark with 11.5K questions across 30 subjects, where GPT-4V achieves only 55.7% vs human 88.6%
- (Evaluating Mathematical Reasoning, 2023) unified 28 visual-math datasets, revealing GPT-4V trails humans by 10.4 points at 49.9% accuracy
- CogAgent (A Visual Language Model for..., 2023) introduced a high-resolution cross-module for GUI understanding, achieving SOTA on AITW with >50% FLOPs reduction
- (Scaling Human-Labeled Tasks, 2024) demonstrated that diverse human-labeled task training followed by minimal GPT-4 alignment achieves superior generalization
- SpatialVLM (Endowing VLMs with Spatial Reasoning, 2024) generated 2 billion synthetic 3D spatial VQA pairs, enabling VLMs to estimate metric distances where GPT-4V produces valid numbers only 1% of the time
π Shift from simple captioning-based VQA to expert-level multimodal reasoning benchmarks (MMMU, MathVista) that exposed massive gaps between model and human performance, driving the field toward structured reasoning.
- LLaVA-CoT (Let Vision Language Models Reason Step-by-Step, 2024) introduced stage-wise retracing search, surpassing GPT-4o-mini on multimodal reasoning benchmarks
- R1-Zero (Visual Reasoning on a 2B..., 2025) first replicated DeepSeek R1's 'aha moment' in multimodal setting, applying GRPO directly to base VLMs without SFT
- MedVLM-R1 (Medical VLM via Reinforcement Learning, 2025) boosted medical VQA accuracy from 55.11% to 78.22% using only 600 samples and GRPO, outperforming 72B models
- MME-RealWorld (A Benchmark for MLLM in..., 2025) created a high-resolution human-annotated benchmark where state-of-the-art models fail to surpass 60% accuracy
- MM1.5 (Data-Centric, 2024) provided the definitive data-centric training recipe, achieving 91.0 on DocVQA surpassing GPT-4V
π Emergence of RL-based reasoning (GRPO/RLVR) as a replacement for supervised fine-tuning, enabling models to develop emergent reasoning without human-annotated traces β the 'aha moment' paradigm.
- (Test-Time, 2025) pioneered RL at inference time on unlabeled data, improving ImageNet-Sketch by +52.4% and outperforming GPT-4o on classification
- Hulu-Med (Transparent Generalist Medical VLM, 2025) unified text, 2D/3D images, and video understanding in a single medical architecture, surpassing GPT-4o on 16 of 30 benchmarks
- (Structured Visual Chain-of-Thought, 2025) provided the first large-scale expert-annotated medical reasoning dataset with bounding-box grounded reasoning steps
- CRAG-MM (Comprehensive RAG Benchmark for Multi-modal Multi-turn, 2025) established realistic wearable VQA evaluation where even GPT-5 achieves only 63% accuracy with 31% hallucinations
- (Logic-Driven, 2026) introduced Logical Consistency Reward that penalizes reasoning drift, improving reasoning accuracy by +19.65%
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Multimodal Chain-of-Thought Reasoning | Separate rationale generation from answer inference, injecting dense vision features into a two-stage framework that first explains, then concludes. | Improves on direct answer prediction by +3.68% on ScienceQA (85.31% vs 81.63% text-only CoT), and LLaVA-CoT surpasses GPT-4o-mini and Gemini-1.5-Pro on average across 6 benchmarks. | Multimodal Chain-of-Thought Reasoning in Language... (2023), LLaVA-CoT (2024), DDCoT (2023), Compositional Chain-of-Thought Prompting for Large... (2023) |
| Reinforcement Learning for Visual Reasoning | Bypass supervised fine-tuning entirely, applying RL with simple accuracy and format rewards directly on base VLMs to induce spontaneous multi-step reasoning. | VisualThinker-R1-Zero achieves 59.47% on CVBench, outperforming the SFT-tuned Qwen2-VL-2B by ~2% and the base model by ~30%. TTRV improves ImageNet-Sketch by +52.4% at test time. | R1-Zero's 'Aha Moment' in Visual... (2025), TTRV (2025), Med-R1 (2025), Game-RL (2025) |
| Visual Program Execution | Replace end-to-end neural inference with LLM-generated code that orchestrates pre-trained vision models via a simple API, enabling zero-shot compositional reasoning. | ViperGPT achieves 72.0% on RefCOCO zero-shot, outperforming GLIP by +17.0%, and surpasses the 80B Flamingo on OK-VQA (51.9%) despite being zero-shot. ProViQ improves ActivityNet-QA by +25% over prior zero-shot methods. | ViperGPT (2023), Visual Program Distillation (2023), Zero-Shot (2023) |
| Data-Centric Multimodal Instruction Tuning | Maximize task diversity and data quality through systematic curation, using multi-modal in-context examples and two-stage tuning to balance capability and alignment. | Vision-Flan achieves +3.1 on MM-Bench and +6.5 on MME over LLaVA-1.5 while maintaining 84.0% on catastrophic forgetting benchmarks vs 73.3%. MIMIC-IT's Otter model achieves highest Elo rating (1014.7) on Multi-Modality Arena. | MIMIC-IT (2023), Vision-Flan (2024), MM1.5 (2024) |
| High-Resolution Efficient VLM Architectures | Decouple high-resolution detail capture from the main model pathway using lightweight branches, token compression, or linear-complexity layers to scale visual context affordably. | CogAgent achieves SOTA on AITW and 9 VQA benchmarks while reducing FLOPs by >50% vs scaling standard models to 1120Γ1120. LLaVA-Phi (3B) outperforms 7B+ models on ScienceQA. LongLLaVA processes ~1000 images on a single A100. | CogAgent (2023), LLaVA-Phi (2024), LongLLaVA (2024), Kimi-VL (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MMMU (Massive Multi-discipline Multimodal Understanding) | Accuracy (%) | 64.0% | Kimi-VL (2025) |
| MathVista (Mathematical Reasoning in Visual Contexts) | Accuracy (%) | 80.1% | Kimi-VL (2025) |
| ScienceQA (Science Question Answering) | Accuracy (%) | 85.31% | Multimodal Chain-of-Thought Reasoning in Language... (2023) |
| OK-VQA (Outside Knowledge Visual Question Answering) | Accuracy (%) | 62.0% | Bootstrapping Large Language Models with... (2026) |
| MME-RealWorld (Real-World Multimodal Evaluation) | Accuracy (%) | <60% | MME-RealWorld (2025) |
β οΈ Known Limitations (4)
- Hallucination and factual unreliability: VLMs frequently generate plausible but incorrect answers based on language priors rather than visual evidence, especially for rare entities or fine-grained details (affects: Multimodal Chain-of-Thought Reasoning, Data-Centric Multimodal Instruction Tuning)
Potential fix: Bottom-up reasoning with scene graph verification, contrastive self-training (VC-STaR), visual attention amplification, and Image-DPO training to penalize text-prior-based guessing - Domain transfer gap: Models trained on internet-scale data struggle severely in specialized domains (medical, remote sensing, ancient documents) due to missing fine-grained perceptual and knowledge requirements (affects: Data-Centric Multimodal Instruction Tuning, High-Resolution Efficient VLM Architectures)
Potential fix: Domain-specific pre-training with curriculum strategies, specialized vision encoders (MedSigLIP), cross-spectral bridging (GRAFT), and GRPO-based domain adaptation requiring minimal annotated samples - Adversarial vulnerability: VLMs are highly susceptible to adversarial visual perturbations that bypass safety alignment, and Chain-of-Thought reasoning provides only marginal robustness improvements (affects: Multimodal Chain-of-Thought Reasoning, High-Resolution Efficient VLM Architectures)
Potential fix: Adversarial pre-training at web scale (Ξ-CLIP), double visual defense combining adversarial pre-training with adversarial instruction tuning, and ECSO's training-free image-to-text transformation for safety restoration - Cultural and linguistic bias: Models are predominantly Western/English-centric, with massive performance gaps on non-Western cultural concepts and low-resource languages (up to 30+ percentage point drops) (affects: Data-Centric Multimodal Instruction Tuning, Reinforcement Learning for Visual Reasoning (GRPO/RLVR))
Potential fix: Native-language dataset construction (AMCrawl for Arabic), culturally sourced benchmarks (CulturalVQA, K-Viscuit), and scalable multilingual chart generation via code decoupling
π View major papers in this topic (10)
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2023-11) 9
- ViperGPT: Visual Inference via Python Execution for Reasoning (2023-03) 9
- CogAgent: A Visual Language Model for GUI Agents (2023-12) 9
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning (2023-06) 9
- R1-Zero's 'Aha Moment' in Visual Reasoning on a 2B Non-SFT Model (2025-03) 9
- MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts (2023-10) 9
- Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding (2025-10) 9
- TTRV: Test-Time Reinforcement Learning for Vision Language Models (2025-10) 9
- Double Visual Defense: A Novel Adversarial Defense for Vision-Language Models (2025-02) 9
- S-Chain: Structured Visual Chain-of-Thought for Medicine (2025-10) 9
π‘ Within the same paradigm, another important research direction focuses on Visual Grounding and Object Detection.
Visual Grounding and Object Detection
What: Visual grounding connects natural language descriptions to specific regions, objects, or temporal segments in images, videos, and 3D scenes, enabling precise spatial localization.
Why: Reliable grounding is essential for embodied agents, GUI automation, medical diagnosis, and autonomous driving where imprecise localization leads to catastrophic failures.
Baseline: Standard approaches use independent visual and text encoders with cross-modal fusion decoders to regress bounding box coordinates from image-text pairs.
- Models often hallucinate objects or ignore visual evidence, relying on language priors instead of genuinely grounding predictions in image content
- Precise coordinate generation is brittle for language-centric architectures, especially with small objects, high-resolution screens, and cluttered scenes
- Extending grounding from 2D images to 3D scenes, temporal video segments, and interactive environments requires multi-step spatial reasoning
π§ͺ Running Example
Baseline: A standard VLM processes the entire image globally and predicts coordinates as text tokens (e.g., '[0.45, 0.62, 0.52, 0.70]'). It often selects the most salient mug in the scene rather than the one specifically behind the laptop, because the independent encoders lack text-conditioned visual attention.
Challenge: This example requires: (1) resolving spatial relations ('behind the laptop'), (2) distinguishing among multiple similar objects ('small red mug' vs. other mugs), and (3) attending to a small region in a cluttered scene where language priors may override visual evidence.
π Overall Progress
The field has undergone a paradigm shift from static supervised coordinate regression to dynamic RL-based policy optimization, where models actively search images and ground reasoning in visual evidence. Early work established contrastive pretraining and high-resolution architectures as foundations, while recent advances demonstrate that small RL-trained models (3-7B) can outperform much larger supervised models (72B+) on precision grounding tasks. The convergence of GUI grounding, 3D spatial reasoning, and medical/remote sensing applications shows grounding becoming a universal capability rather than a niche task.
π Sub-topics
Reinforcement Learning for Visual Perception & Grounding
45 papers
Applies reinforcement learning with verifiable rewards (RLVR) β such as IoU scores and format checks β to train VLMs for precise visual localization, replacing supervised fine-tuning with policy optimization that directly optimizes geometric metrics.
Grounded Visual Reasoning & Chain-of-Thought
40 papers
Methods that interleave textual reasoning steps with explicit visual references (bounding boxes, cropped regions, visual tokens) to anchor multi-step reasoning in spatial evidence rather than relying on text-only chains.
GUI & Screen Agent Grounding
25 papers
Focuses on precisely localizing UI elements (buttons, text fields, icons) in high-resolution screenshots for GUI automation, addressing challenges like visual clutter, tiny targets, and the mismatch between dense pixel coordinates and language tokens.
3D Scene Understanding & Visual Grounding
25 papers
Extends visual grounding from 2D images to 3D scenes, leveraging multi-view reasoning, point clouds, and Bird's Eye View representations to localize objects in physical environments for embodied AI.
Open-Vocabulary Detection & Segmentation
35 papers
Leverages vision-language pretraining (CLIP, SigLIP) to detect and segment arbitrary objects described in natural language, enabling zero-shot generalization beyond fixed training categories.
Temporal Video Grounding
15 papers
Localizes specific temporal segments in videos given natural language queries, combining visual understanding with temporal reasoning using RL-based optimization of timestamp predictions.
Domain-Specific Grounding (Medical, Remote Sensing, Robotics)
40 papers
Adapts visual grounding to specialized domains requiring domain knowledge, including medical image grounding for radiology, satellite imagery analysis, and robotic manipulation with physical reasoning.
Hallucination Mitigation & Visual Grounding Reliability
24 papers
Addresses the fundamental reliability challenge where VLMs generate plausible but visually ungrounded outputs, developing detection methods, evaluation benchmarks, and mitigation techniques to ensure outputs are anchored in visual evidence.
π‘ Key Insights
π‘ RL with geometric rewards enables small 3-7B models to outperform 72B supervised models on grounding
π‘ Grounded chain-of-thought anchoring reasoning in bounding boxes dramatically reduces visual hallucination
π‘ Zero-shot 3D grounding via multi-view 2D reasoning rivals supervised methods trained on 3D data
π‘ Coordinate-free attention grounding eliminates brittle text-to-number generation for GUI agents
π‘ Contrastive backbones like SigLIP are now the standard foundation for all grounding tasks
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from improving vision-language alignment quality (2023-2024) through data-centric training recipes (2024-2025) to RL-driven active visual reasoning (2025-2026), with increasing emphasis on grounded chain-of-thought that anchors every reasoning step in spatial evidence.
- PaLI-3 (PaLI-3, 2023) demonstrated that SigLIP contrastive backbones massively outperform classification-pretrained ViTs for localization, achieving +23.7% on RefCOCOgs
- (CogAgent, 2023) pioneered dual-resolution visual processing for GUI understanding, enabling high-resolution screen grounding with >50% fewer FLOPs
- PaLI-X (On Scaling up a Multilingual..., 2024) jointly scaled vision and language to achieve 86.0 on VQAv2 with integrated OCR pretraining
- 3D-GRAND (3D-GRAND: A Million-Scale Densely-Grounded 3D-LLM Dataset, 2024) introduced million-scale densely grounded 3D data, outperforming prior 3D-LLMs by +7.7% on ScanRefer
π Shift from classification-pretrained vision backbones to contrastively pretrained encoders (SigLIP/CLIP) as the standard for grounding tasks.
- (VLM, 2024) introduced text-conditioned attention pooling, outperforming CLIP by +14.4% mIoU on zero-shot segmentation with 100x less data
- SimVG (A Simple Framework for Visual..., 2024) achieved 94.46% accuracy on RefCOCO testA via dynamic weight-balance distillation, training in 12 hours on a single GPU
- (Visual Reinforcement Fine-Tuning, 2025) launched the RL-for-vision paradigm by extending DeepSeek-R1 style training to visual tasks, improving mAP from 9.8 to 31.3 on COCO
- MM1.5 (Methods, Analysis & Insights from..., 2024) established the data-centric recipe for balancing OCR, grounding, and general capabilities across training stages
- Perception-R1 (Perception-R1, 2025) became the first pure MLLM to surpass 30% mAP on COCO using bipartite matching rewards
- ViGoRL (Grounded Reinforcement Learning for Visual Reasoning, 2025) combined MCTS-guided training with active visual search, achieving 86.4% on V*Bench and outperforming proprietary models
- (Coordinate-Free, 2025) eliminated coordinate generation entirely using attention-based grounding, with 7B model outperforming 72B UI-TARS
- Molmo2 (Open Weights and Data for..., 2026) provided the first fully open video grounding pipeline with tracking and pointing, outperforming Gemini 2.5 Pro on ReasonVOS
π Fundamental shift from supervised coordinate regression to RL-based policy optimization with verifiable visual rewards, enabling models to actively search and reason over images.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Visual Reinforcement Fine-Tuning | Replace supervised imitation with Group Relative Policy Optimization (GRPO) using geometric rewards (IoU, mAP) to directly optimize visual perception and grounding. | Improves on SFT baselines by +21.5 mAP on COCO open-vocabulary detection (Visual-RFT), achieving 31.3 mAP. Perception-R1 surpasses 30% mAP threshold on COCO, the first pure MLLM to do so. | Visual-RFT (2025), Perception-R1 (2025), VLM-R1 (2025), Perception-Aware (2025) |
| Grounded Visual Chain-of-Thought Reasoning | Redefine each reasoning step as a tuple of text thought plus spatial coordinate, forcing the model to 'point and look' at evidence while thinking. | ViGoRL improves on vanilla GRPO by +12.9% accuracy on SAT-2 spatial reasoning benchmark, achieving 86.4% on V*Bench. Argus achieves 62.7 on MMVP, surpassing Gemini 1.5 Pro (61.3). | Grounded Reinforcement Learning for Visual... (2025), Argus (2025), GRIT (2025), VoCoT (2024) |
| Coordinate-Free & Attention-Based GUI Grounding | Use attention heads or patch-level scoring to directly map instructions to visual regions, bypassing the text-to-coordinate generation bottleneck. | GUI-Actor-7B achieves 44.6 on ScreenSpot-Pro, outperforming the much larger UI-TARS-72B (38.1). SE-RFT-7B achieves 47.3% on ScreenSpot-Pro, surpassing UI-TARS-72B by 24.2%. | CogAgent (2023), GUI-Actor (2025), Enhancing Visual Grounding for GUI... (2025), UI-AGILE (2025) |
| Zero-Shot 3D Visual Grounding via Multi-View VLM Reasoning | Reconceptualize 3D understanding as iterative 2D viewpoint selection and multi-view ensemble projection, bypassing scarce 3D-language datasets. | VLM-Grounder achieves 51.6% Acc@0.25 on ScanRefer, outperforming ZS3DVG by +15.2 points. SeqVLM achieves 55.6% Acc@0.25 on ScanRefer, surpassing previous zero-shot SOTA by +4.0%. | VLM-Grounder (2024), GPT4Scene (2025), Agent3D-Zero (2024), 3D-GRAND: A Million-Scale Densely-Grounded 3D-LLM... (2024) |
| Fine-Grained Vision-Language Alignment for Open-Vocabulary Detection | Condition image representations on specific text queries via attention pooling or pixel-level contrastive learning to capture local visual details. | PaLI-3 (5B) surpasses PaLI-X (55B) on 8 text understanding tasks. FLAIR outperforms CLIP by +14.4% mIoU on zero-shot segmentation despite using 100x less data. MM-Grounding-DINO improves original Grounding-DINO by +12.6 AP on LVIS. | PaLI-3 (2023), FLAIR (2024), An Open and Comprehensive Pipeline... (2024), On Scaling up a Multilingual... (2024) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| RefCOCO (testA) | Accuracy (IoU@0.5) | 94.46% | SimVG (2024) |
| ScreenSpot-Pro | Accuracy (%) | 47.3% | Enhancing Visual Grounding for GUI... (2025) |
| COCO Object Detection (mAP) | mAP (mean Average Precision) | 31.9% mAP | Perception-R1 (2025) |
| ScanRefer (Zero-Shot, Acc@0.25) | Accuracy@0.25 IoU | 55.6% | SeqVLM (2025) |
| V*Bench | Accuracy (%) | 86.4% | Grounded Reinforcement Learning for Visual... (2025) |
β οΈ Known Limitations (4)
- Scale-driven bias in RL training causes models to ignore small but critical objects, as large visual regions dominate reward signals during optimization. (affects: Visual Reinforcement Fine-Tuning (Visual-RFT / GRPO for Vision), Grounded Visual Chain-of-Thought Reasoning)
Potential fix: Scale Relative Policy Optimization (SRPO) normalizes rewards within size bins so small regions compete fairly, as demonstrated by Ground-R1 with +11.9% improvement on V* benchmark. - Extended reasoning chains degrade visual grounding β longer thinking causes models to drift from image evidence and amplify hallucinations, a phenomenon termed 'more thinking, less seeing'. (affects: Grounded Visual Chain-of-Thought Reasoning, Visual Reinforcement Fine-Tuning (Visual-RFT / GRPO for Vision))
Potential fix: PEARL introduces a Fidelity Gate that halts reasoning policy updates when perception checks fail, and PeRL-VL decouples perception training from reasoning to prevent visual signal degradation. - GUI and high-resolution grounding methods are highly sensitive to image noise and visual perturbations, with Visual CoT methods showing higher fragility than standard VLMs in 70 out of 96 corrupted settings. (affects: Coordinate-Free & Attention-Based GUI Grounding, Grounded Visual Chain-of-Thought Reasoning)
Potential fix: Injecting high-confidence detection cues from external object detectors (like Grounding DINO) stabilizes intermediate visual steps and mitigates fragility of internal localization. - 3D visual grounding still relies heavily on multi-view rendering or point clouds, creating computational bottlenecks and losing fine-grained details during 2D-to-3D projection. (affects: Zero-Shot 3D Visual Grounding via Multi-View VLM Reasoning)
Potential fix: Explicit 3D representations as reasoning interfaces (SpatialReasoner) that predict calibrated 3D vectors as intermediate steps, and video-based approaches (3D-RFT) that bypass point cloud processing.
π View major papers in this topic (10)
- Grounded Reinforcement Learning for Visual Reasoning (2025-05) 9
- CogAgent: A Visual Language Model for GUI Agents (2023-12) 9
- Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding (2026-01) 9
- On Scaling up a Multilingual Vision and Language Model (PaLI-X) (2024-07) 9
- Visual-RFT: Visual Reinforcement Fine-Tuning (2025-03) 8
- Perception-R1: Pioneering Perception Policy with Reinforcement Learning (2025-04) 8
- 3D-GRAND: A Million-Scale Densely-Grounded 3D-LLM Dataset (2024-06) 9
- GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents (2025-06) 8
- FLAIR: VLM with Fine-grained Language-informed Image Representations (2024-12) 8
- MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations (2024-06) 9
π‘ Within the same paradigm, another important research direction focuses on Image Captioning.
Image Captioning
What: Image captioning generates natural language descriptions of visual content, bridging perception and language understanding in vision-language models.
Why: High-quality captions enable downstream tasks like visual reasoning, text-to-image generation, and accessible image understanding for diverse applications.
Baseline: Standard supervised fine-tuning trains models on human-annotated image-caption pairs using cross-entropy loss over ground-truth sequences.
- Generating detailed, factually accurate descriptions without hallucinating objects or attributes not present in the image
- Evaluating caption quality reliably when traditional metrics poorly correlate with human judgment for long, detailed descriptions
- Scaling captioning across diverse visual domains including charts, documents, remote sensing, and specialized imagery
π§ͺ Running Example
Baseline: A standard SFT model might produce 'A busy outdoor market with people shopping' β generic and missing key details like the musician, specific produce types, and the child with the balloon, or it may hallucinate objects not present such as a dog near the stalls.
Challenge: This example illustrates three key challenges: (1) the model must describe many fine-grained details without hallucinating (e.g., inventing a dog not in the scene), (2) it must maintain narrative coherence across a complex scene with multiple focal areas, and (3) traditional metrics like BLEU cannot distinguish between a vague correct caption and a richly detailed accurate one.
π Overall Progress
Image captioning has undergone two major paradigm shifts: first from traditional n-gram metrics to human-aligned evaluation frameworks based on atomic fact decomposition and pairwise comparison, and second from supervised fine-tuning to reinforcement learning with verifiable rewards. Modern systems achieve human-competitive detailed captioning (GPT-4o surpasses human baselines on CapArena) while simultaneously reducing hallucinations through inference-time search and multi-agent verification. The field has also expanded from natural-image-only captioning to unified multi-domain systems covering documents, charts, 3D scenes, and multilingual content.
π Sub-topics
Dense and Detailed Image Captioning
8 papers
Methods for generating comprehensive, fine-grained image descriptions that capture objects, attributes, spatial relations, and contextual details beyond simple one-sentence captions. This sub-topic focuses on training paradigms (especially RL) and inference strategies that improve caption richness and accuracy.
Captioning Evaluation and Benchmarks
5 papers
New metrics, benchmarks, and evaluation frameworks designed to accurately measure the quality, factuality, and comprehensiveness of detailed image captions generated by modern VLMs, moving beyond legacy n-gram-based metrics.
Domain-Specific and Multimodal Captioning
10 papers
Captioning systems tailored for specialized visual domains including remote sensing imagery, scientific figures, geometric diagrams, text-rich documents, manga narratives, and 3D scenes, addressing unique challenges each domain presents.
Knowledge-Augmented and Personalized Captioning
6 papers
Approaches that integrate external knowledge sources, retrieval-augmented generation, or user-specific concept databases to generate more informative and personalized captions that go beyond generic visual descriptions.
Captioning for Downstream Applications
4 papers
Using image captioning as an intermediary text representation to enable tasks like video anomaly detection, safety filtering, physics question answering, and synthetic training data generation.
Robustness, Bias, and Training Paradigms
7 papers
Research addressing VLM robustness to adversarial attacks, social bias propagation, missing modalities, and novel training strategies including in-context learning, encoder-free architectures, and cross-tokenizer prompt optimization.
π‘ Key Insights
π‘ RL with verifiable rewards outperforms supervised fine-tuning for dense captioning quality and diversity
π‘ Atomic fact decomposition enables reliable hallucination detection where traditional metrics fail
π‘ Inference-time search with value networks achieves 74% human preference over greedy decoding
π‘ Converting images to rich text enables text-only LLMs to match multimodal visual reasoning performance
π‘ Single-image retrieval-based personalization matches or exceeds multi-image fine-tuning approaches
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from improving resolution handling and basic caption quality (2023) through evaluation innovation and personalization (2024) to RL-driven dense captioning and unified multi-domain systems (2025-2026), with increasing emphasis on factual accuracy over fluency and on converting visual perception into text to enable purely linguistic reasoning.
- (Test-Time, 2023) pioneered using CLIP similarity as a reward signal for test-time reinforcement learning adaptation of VLMs, outperforming TPT by 5.4% on average
- Monkey (Image Resolution and Text Label..., 2023) introduced sliding-window patch processing enabling high-resolution captioning up to 1344Γ896 with per-patch LoRA adapters
- LL3DA (Visual Interactive Instruction Tuning for Omni-3D, 2023) enabled 3D scene captioning by integrating visual prompts (clicks, bounding boxes) with point cloud processing via a multi-modal transformer
- CompreCap (Comprehensive Image Captioning Benchmark, 2024) introduced directed scene graph evaluation for detailed captions with hierarchical object-attribute-relation matching
- ALOHa (A New Measure for Hallucination, 2024) replaced fixed-vocabulary CHAIR with open-vocabulary LLM-based hallucination detection, improving +30.8% on out-of-domain objects
- (Retrieval-Augmented, 2024) introduced a remember-retrieve-generate paradigm for personalized captioning with single-image concept learning, achieving 84.1 CIDEr
- (Vision Value Model, 2024) demonstrated inference-time search with learned value networks, achieving 74% human preference over greedy decoding and +10.8% average improvement across 9 benchmarks
- CapMAS (Caption Factuality Multi-Agent System, 2024) introduced decomposition-verification-revision for hallucination correction, identifying that MLLMs rely more on language priors as captions grow longer
π The field shifted from optimizing traditional metrics (BLEU, CIDEr) to developing sophisticated evaluation frameworks based on atomic fact decomposition, directed scene graphs, and human preference modeling.
- CapArena (Benchmarking Detailed Image Captioning, 2025) established Elo-based model ranking for detailed captioning with 94.3% automated correlation to human judgment, showing GPT-4o surpasses human baselines
- (Painting with Words, 2025) reduced hallucinations by 40.5% using atomic decomposition-based RL rewards with a new DCScore metric achieving 0.90 Spearman correlation with VLM Arena
- OmniCaptioner (One Captioner to Rule Them All, 2025) unified captioning across natural images, visual text, and structured visuals with a 21M dataset, enabling text-only LLMs to achieve state-of-the-art visual reasoning
- CapRL (Dense Image Caption via RL, 2025) achieved caption quality comparable to Qwen2.5-VL-72B using a 7B model with perception-reasoning decoupled rewards
- (Top-Down, 2025) reframed captioning as hierarchical planning with MCTS and a lightweight value network, reducing expensive VLM calls by an order of magnitude
π Reinforcement learning with verifiable rewards emerged as the dominant training paradigm for dense captioning, replacing supervised fine-tuning and achieving quality comparable to models 10x larger.
- (Rubric-Guided, 2026) replaced scalar rewards with committee-generated rubrics of binary checkable rules, enabling a 7B model to outperform GPT-4V and Qwen2.5-VL-72B in blind ranking
- MUNIChus (Multilingual News Image Captioning, 2026) introduced the first multilingual news captioning benchmark covering 9 languages with 700K+ images, showing fine-tuned models more than double prompting-based performance
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Reinforcement Learning for Dense Captioning | Caption quality is measured by whether a text-only LLM can answer visual questions using only the generated caption, or via rubric-based binary checks, providing scalable reward signals. | CapRL improves on DenseFusion-1M by +6.8% accuracy on InfoVQA and +3.6% on ChartVQA; RubiCap-7B achieves +20.8% win-rate over base model on PixMoCap, outperforming GPT-4V and Qwen2.5-VL-72B in blind ranking. | Test-Time (2023), Painting with Words (2025), CapRL (2025), RubiCap (2026) |
| Inference-Time Search and Hierarchical Refinement | A trained value network predicts long-term caption quality using visual grounding signals, guiding tree search to explore multiple descriptive paths before committing to final output. | VisVM-guided captions are preferred 74% over greedy decoding in human evaluation, with +10.8% average improvement across 9 benchmarks for LLaVA-Next-7B; TDSR reduces VLM calls by an order of magnitude versus standard search. | Scaling Inference-Time Search with Vision... (2024), Top-Down (2025) |
| Unified Multi-Domain Captioning | Converting diverse visual inputs into rich textual descriptions via unified pipelines enables text-only LLMs to achieve visual reasoning without visual encoder training. | OmniCaptioner + DeepSeek-R1 achieves 40.5% on MathVerse, outperforming Qwen2-VL-7B (31.9%); Monkey improves +9.77% over Qwen-VL on document VQA; LaRA achieves +202 points on OCRBench over LLaVAR. | Monkey (2023), TRINS (2024), One Captioner to Rule Them... (2025), Enhancing Large Vision-Language Models with... (2025) |
| Knowledge-Augmented and Personalized Captioning | Retrieving entity-specific information from external databases and grounding it to detected visual regions produces contextually rich captions that go beyond pattern-matching descriptions. | RAP achieves 84.1 CIDEr on personalized captioning, outperforming MyVLM (76.8) and Yo'LLaVA (73.5); MsRAG outperforms standard mRAG by +21.9% CIDEr using GPT-4o on knowledge-intensive captioning. | MyVLM (2024), RAP (2024), MsRAG (2025) |
| Hallucination-Aware Evaluation Frameworks | Decomposing captions into atomic facts or objects enables fine-grained per-claim verification against the image, replacing holistic similarity scores that mask hallucinations. | CapArena-Auto achieves 94.3% correlation with human rankings, far surpassing traditional METEOR; ALOHa improves hallucination detection by +30.8% over CHAIR on out-of-domain objects (nocaps-FOIL). | CompreCap (2024), ALOHa (2024), Multimodal large language models excel... (2024), CapArena (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| CapArena (Detailed Captioning Elo Rating) | Elo Rating (higher is better) | ~1195 Elo | CapArena (2025) |
| PixMoCap (Dense Captioning Win-Rate) | Win-Rate improvement (% preferred over baseline) | +20.8% win-rate improvement over base model | RubiCap (2026) |
| mmHal-V (Hallucination Benchmark) | Relative Hallucination Reduction (%) | 40.5% relative hallucination reduction | Painting with Words (2025) |
| Personalized Image Captioning CIDEr | CIDEr (higher is better) | 84.1 CIDEr | RAP (2024) |
| nocaps-FOIL (Out-of-Domain Hallucination Detection) | Improvement over CHAIR metric (%) | +30.8% improvement over CHAIR | ALOHa (2024) |
β οΈ Known Limitations (4)
- Hallucination in detailed captions: As models generate longer, more detailed descriptions, they increasingly rely on language priors rather than visual input, causing factual errors that compound over sequence length. (affects: Reinforcement Learning for Dense Captioning, Unified Multi-Domain Captioning)
Potential fix: Multi-agent verification (CapMAS), atomic fact decomposition with per-fact visual grounding (FeedQuill), inference-time search guided by visual similarity (VisVM), and rubric-based RL training (RubiCap). - Evaluation metric limitations: Traditional metrics like BLEU, CIDEr, and METEOR correlate poorly with human judgment for detailed captions, while newer LLM-based metrics are expensive and may introduce their own biases. (affects: Hallucination-Aware Evaluation Frameworks)
Potential fix: CapArena-Auto uses GPT-4o with reference captions to achieve 94.3% correlation with human rankings at lower cost; DCScore decomposes evaluation into verifiable atomic facts; CompreCap uses directed scene graphs for structured evaluation. - Domain gap across visual types: Models trained on natural images perform poorly on charts, documents, scientific figures, remote sensing imagery, and manga due to fundamentally different visual-linguistic patterns in each domain. (affects: Unified Multi-Domain Captioning, Knowledge-Augmented and Personalized Captioning)
Potential fix: Domain-specific mixture of experts (RS-MoE achieving 13B-level performance with 1B parameters), massive multi-domain training datasets (OmniCaptioner's 21M samples), and OCR-augmented architectures (LaRA) that explicitly feed text content to the LLM. - Vulnerability to adversarial perturbations and bias propagation: VLMs can be manipulated by imperceptible frequency-domain perturbations and exhibit systematic social biases that propagate from embeddings to downstream captioning and retrieval outputs. (affects: Unified Multi-Domain Captioning, Inference-Time Search and Hierarchical Refinement)
Potential fix: Frequency-domain robustness training, bias-aware calibration methods, and multi-model ensemble verification; larger models exhibit stronger bias propagation (Spearman Ο=0.88 for CLIP-L-14 vs 0.80 for CLIP-B-32), suggesting model scaling alone will not resolve this.
π View major papers in this topic (10)
- CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era (2025-03) 9
- RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning (2026-03) 8
- CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning (2025-09) 8
- Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension (2024-12) 8
- One Captioner to Rule Them All (2025-04) 8
- Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning (2025-03) 8
- RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models (2024-10) 8
- Multimodal large language models excel at generating highly detailed captions but often produce hallucinations (2024-12) 8
- Top-Down Semantic Refinement for Image Captioning (2025-10) 8
- RS-MoE: A Vision-Language Model with Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering (2024-11) 8
π‘ Within the same paradigm, another important research direction focuses on Document and Chart Understanding.
Document and Chart Understanding
What: Research on enabling vision-language models to accurately read, parse, and reason over documents, charts, tables, and other visually structured content.
Why: Trillions of pages of knowledge are locked in PDFs, charts, and scanned documents that current AI systems cannot reliably extract or reason about.
Baseline: Traditional pipelines chain separate OCR, layout analysis, and text-based language models, losing visual context and propagating errors across stages.
- Balancing fine-grained text recognition with global layout understanding across high-resolution, multi-page documents
- Performing multi-step numerical and logical reasoning over chart data requiring precise visual grounding
- Scaling to production deployment with compact models while handling diverse languages, scripts, and real-world distortions
π§ͺ Running Example
Baseline: A traditional OCR pipeline extracts text from all 50 pages but misses the bar chart's projected values entirely, retrieves irrelevant pages about other revenue segments, and cannot cross-reference the textual revenue figure with the visual forecast.
Challenge: This example requires three capabilities current systems lack: (1) retrieving the correct page from a long document using visual cues, (2) extracting precise numerical values from a bar chart, and (3) performing comparative reasoning across text and chart modalities.
π Overall Progress
Document and chart understanding has undergone two major paradigm shifts: first, from text-based to visual-centric retrieval (2024), and second, from supervised fine-tuning to reinforcement learning with verifiable rewards (2025). The field has also demonstrated that compact sub-1B models with specialized architectures can match or exceed general-purpose models 100x their size, suggesting that task-specific design outweighs brute-force scaling for structured document tasks.
π Sub-topics
End-to-End OCR & Document Parsing
15 papers
Research on unified models that convert document images directly into structured text (Markdown, HTML, JSON) without brittle multi-stage pipelines, increasingly using reinforcement learning for optimization.
Chart Reasoning & Understanding
10 papers
Methods for extracting data from and performing complex multi-step reasoning over charts, graphs, and flowcharts, including numerical comprehension and cross-subchart inference.
Multi-Page Document Understanding & RAG
11 papers
Systems that retrieve and reason across multiple pages or documents using visual embeddings, dynamic retrieval strategies, and multi-agent architectures to answer complex questions.
Document-Centric VLM Architectures
7 papers
Large-scale vision-language models specifically designed or adapted for document understanding through joint scaling of vision encoders and language decoders with document-specific training objectives.
Benchmarks & Evaluation
14 papers
New evaluation frameworks and datasets that expose limitations of current models on real-world documents, including multilingual charts, ancient scripts, enterprise content, physical distortions, and agentic document navigation.
π‘ Key Insights
π‘ Reinforcement learning with verifiable rewards outperforms supervised fine-tuning for document OCR and chart reasoning.
π‘ Sub-1B parameter models match 100x-larger VLMs on document parsing with specialized architecture design.
π‘ Visual-centric retrieval outperforms text-based retrieval by 20%+ on layout-rich documents.
π‘ Training on few complex reasoning examples transfers better than thousands of simple extraction tasks.
π‘ State-of-the-art models still fail below 60% accuracy on real-world document benchmarks.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from scaling general VLMs with document data (2024) to specialized compact architectures trained with reinforcement learning (2025-2026), while benchmarks have shifted from academic multiple-choice formats to agentic, real-world evaluations testing multi-hop reasoning across modalities.
- (SPHINX-X, 2024) simplified MLLM training into a single-stage paradigm with learnable skip tokens, covering 1.1B to MoE scales
- ChartPaLI-5B (Chart-based Reasoning, 2024) pioneered transferring reasoning from LLMs to a 5B VLM, outperforming GPT-4V on ChartQA
- PaLI-X (On Scaling up a Multilingual..., 2024) jointly scaled vision (ViT-22B) and language (32B) components to new SOTA on VQAv2 (86.0) and TextVQA (84.5)
- Idefics3 (Building and better understanding vision-language models, 2024) released Docmatix, a 240x larger open document dataset, with +13.7 point DocVQA improvement
- M3(M3DocRAG, 2024) introduced visual-centric RAG, encoding pages as images and reducing retrieval latency from 20s to under 2s per query
- (SV-RAG, 2024) reused the MLLM's own hidden states for visual retrieval, eliminating the need for separate encoders
- olmOCR (olmOCR: Unlocking Trillions of Tokens..., 2025) enabled PDF processing at $176 per million pages via document-anchored distillation, 35x cheaper than GPT-4o
- (MME-RealWorld, 2025) exposed massive gaps between academic and real-world performance, with GPT-4o failing to reach 60% accuracy
π Shift from text-based to visual-centric document retrieval: treating pages as images for retrieval rather than extracting text first, preserving charts and layout information.
- Chart-R1 (Chart-R1, 2025) combined code-based data synthesis with CoT-RL, surpassing GPT-4o on ChartQA with 83.9%
- olmOCR 2 (olmOCR 2: Unit Test Rewards..., 2025) introduced unit-test-based RL rewards, achieving +14.2 point OCR improvement over the initial release
- (TRivia, 2025) demonstrated self-supervised table recognition that surpasses Gemini 2.5 Pro and GPT-5 without labeled data
- (Chain-of-Evidence, 2025) introduced RL-based evidence grounding with bounding box attribution, improving localization IoU by 47.0%
π Reinforcement learning with verifiable rewards (RLVR) replaces supervised fine-tuning as the dominant training paradigm for document tasks, enabling optimization for functional correctness without expensive labels.
- (GLM-OCR, 2026) ranked first on OmniDocBench v1.5 with a 0.9B model using multi-token prediction for 50% throughput gain
- PaddleOCR-VL-1.5 (PaddleOCR-VL-1.5, 2026) achieved 94.5% accuracy with mask-based segmentation for warped documents, outperforming 235B-parameter VLMs
- MADQA (Strategic Navigation or Stochastic Search?, 2026) introduced agentic document QA, revealing that humans achieve 50% accuracy on first query while Gemini 3 Pro starts at ~12%
- (VisDoT, 2026) formalized graphical perception theory for chart grounding, achieving +33.2% on VisDoTQA and surpassing GPT-4o on ChartQAPro
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Reinforcement Learning for Document Intelligence | Uses verifiable, rule-based rewards (unit tests, numerical accuracy) instead of human labels to train document understanding models via RL. | Improves on supervised fine-tuning (SFT) by +14.2 points on olmOCR-Bench (olmOCR 2) and +16.7% relative on MultiChartQA (Chart-RL); TRivia surpasses Gemini 2.5 Pro on CC-OCR benchmark, achieving 84.15 vs 79.46 TEDS. | olmOCR 2: Unit Test Rewards... (2025), Chart-RL (2026), TRivia (2025), LightOnOCR (2026) |
| Visual Document Retrieval-Augmented Generation | Encodes document pages as visual embeddings using models like ColPali, enabling retrieval that preserves charts, tables, and layout information lost by OCR. | Improves on text-based RAG baselines by +22.5% Recall@1 on page retrieval (MMDocIR); SimpleDoc achieves 60.58% on MMLongBench, outperforming M3DocRAG (41.8%) and MDocAgent (55.3%) with +10.4% on LongDocURL. | M3DocRAG (2024), Chain-of-Evidence (2025), MURE (2026), SimpleDoc (2025) |
| Compact End-to-End Document Parsers | Separates layout analysis (detection) from content recognition (VLM decoding) in sub-1B models with multi-token prediction for speed. | GLM-OCR achieves 94.6 on OmniDocBench v1.5, ranking first among all models and outperforming GPT-5.2 (87.5) on Nanonets-KIE with 93.7; DocVLM improves DocVQA by +30.6% (56.0% to 86.6%) under a strict 256 visual token limit. | GLM-OCR (2026), PaddleOCR-VL-1.5 (2026), olmOCR: Unlocking Trillions of Tokens... (2025), DocVLM (2024) |
| Chart Reasoning via Grounding and Transfer | Synthesizes reasoning traces from LLMs and decomposes chart questions into perceptual grounding and logical inference sub-tasks. | Chart-R1 achieves 83.9% on ChartQA, surpassing GPT-4o (80.3%) and Claude-3.5-Sonnet (82.1%); VisDoT improves ChartQA by +11.2% via human-like interpretation grounding and surpasses GPT-4o on ChartQAPro. | Chart-based Reasoning (2024), Chart-R1 (2025), VisDoT (2026), ReFocus (2025) |
| Scaled Multi-Task VLMs for Documents | Simultaneously scales both vision and language components with multi-stage training recipes including document-specific objectives like text spotting. | PaLI-X achieves 86.0 on VQAv2, surpassing the previous 84.3 SOTA, and 84.5 on TextVQA (+4.6 over prior best 79.9); Idefics3-8B improves DocVQA by +13.7 points over Idefics2-8B using the 240x larger Docmatix dataset. | On Scaling up a Multilingual... (2024), PaliGemma 2 (2024), Building and better understanding vision-language... (2024), SPHINX-X (2024) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| ChartQA | Accuracy (%) | 83.9% | Chart-R1 (2025) |
| OmniDocBench v1.5 | Overall Score | 94.6 | GLM-OCR (2026) |
| DocVQA | Accuracy (%) | 86.6% | DocVLM (2024) |
| CC-OCR (Table Recognition) | TEDS (Tree Edit Distance Similarity) | 84.15 TEDS | TRivia (2025) |
| olmOCR-Bench | Overall Score (unit test pass rate) | +14.2 points over olmOCR v1 | olmOCR 2: Unit Test Rewards... (2025) |
β οΈ Known Limitations (4)
- Severe multilingual performance degradation: models trained primarily on English data show massive accuracy drops on low-resource languages (e.g., Hindi, Bengali, Odia), limiting global applicability of document understanding systems. (affects: Scaled Multi-Task VLMs for Documents, Chart Reasoning via Grounding and Transfer)
Potential fix: Scalable multilingual data generation pipelines (like PolyChartQA's code decoupling approach) and localized curriculum learning (like VARCO-VISION's bilingual training) show promise for reducing language gaps. - Physical distortion fragility: most document parsers are optimized for clean, digital-born documents and fail significantly on scanned, warped, skewed, or poorly lit real-world documents encountered in production settings. (affects: Compact End-to-End Document Parsers, Visual Document Retrieval-Augmented Generation)
Potential fix: PaddleOCR-VL-1.5's mask-based instance segmentation and SAVIOR's targeted fine-tuning on failure-inducing patterns address specific distortion types, but a general-purpose solution remains elusive. - Benchmark-reality gap: current benchmarks use multiple-choice formats and synthetic data that fail to capture the complexity of enterprise deployment, where free-form generation, noisy inputs, and domain-specific schemas are the norm. (affects: Reinforcement Learning for Document Intelligence, Chart Reasoning via Grounding and Transfer)
Potential fix: Frameworks like ViLD (enterprise-focused evaluation) and MADQA (agentic document QA with accuracy-effort trade-off metrics) are beginning to bridge this gap by evaluating operational capabilities. - Cross-modal reasoning weakness: models show strong text modality bias and struggle with questions requiring integration of evidence across text, tables, and charts within the same document, especially for comparative and tabular reasoning. (affects: Visual Document Retrieval-Augmented Generation, Scaled Multi-Task VLMs for Documents)
Potential fix: Chain-of-Evidence's RL-based stepwise attribution and VisDoT's decomposition-of-thought approach show that explicitly grounding reasoning steps in visual regions can mitigate cross-modal failures.
π View major papers in this topic (10)
- olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models (2025-02) 9
- Logics-Parsing-Omni Technical Report (2026-03) 9
- Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections (2026-03) 9
- MME-RealWorld: A Benchmark for MLLM in the Real World (2025-02) 9
- On Scaling up a Multilingual Vision and Language Model (2024-07) 9
- TRivia: Train Your Own Proprietary Model with Unlabeled Data (2025-12) 9
- olmOCR 2: Unit Test Rewards for Document OCR (2025-10) 8
- Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner (2025-07) 8
- GLM-OCR Technical Report (2026-03) 8
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding (2024-11) 8
π‘ Moving to the next paradigm, we turn to Multimodal Reasoning.
Multimodal Reasoning
What: Research on enabling models to perform complex multi-step reasoning across visual, auditory, and textual modalities, integrating perception with logical inference.
Why: Real-world tasksβfrom solving math problems with diagrams to navigating GUIsβrequire jointly understanding multiple modalities and reasoning over them coherently.
Baseline: Standard multimodal models encode image and text inputs through separate encoders, then generate answers in a single forward pass without iterative verification.
- Visual perception errors propagate through reasoning chains, causing cascading failures in downstream steps
- Verbose Chain-of-Thought reasoning increases latency and compute cost without proportional accuracy gains
- Binary reward signals in reinforcement learning provide no gradient for near-correct predictions
π§ͺ Running Example
Baseline: A standard MLLM reads the image in one pass, misidentifies which angle is 40Β° or overlooks the bisector line, then chains incorrect geometric relationships to produce a wrong answer with no opportunity for self-correction.
Challenge: This example illustrates three key challenges: (1) accurate visual perception to extract angle labels and line segments from the diagram, (2) multi-step geometric reasoning requiring correct prerequisite knowledge, and (3) the need for step-level verification to catch perception errors before they corrupt the entire reasoning chain.
π Overall Progress
Multimodal reasoning has evolved from evaluating basic visual understanding to actively orchestrating multi-step reasoning with tool use and self-verification. Key paradigm shifts include the transition from binary to process-level rewards, the compression of verbose reasoning into latent representations, and the emergence of agentic frameworks that decouple perception from reasoning. The field now approaches problems where 8B-parameter models with structured training can match or exceed models 10x their size.
π Sub-topics
Multimodal Mathematical Reasoning
7 papers
Methods and benchmarks for solving mathematical problems that require understanding visual diagrams, charts, or figures alongside textual problem statements, often involving multi-step logical deduction with process-level supervision.
Efficient and Structured Chain-of-Thought
5 papers
Techniques for compressing, pruning, or restructuring Chain-of-Thought reasoning to reduce computational overhead while preserving or improving accuracy, including latent-space reasoning and iterative test-time scaling.
GUI and Agentic Multimodal Reasoning
4 papers
Research on autonomous agents that interact with graphical interfaces or actively invoke external toolsβcode execution, web search, visual manipulationβduring multimodal reasoning loops.
Domain-Specific Multimodal Reasoning
7 papers
Application of multimodal reasoning to specialized domains including image quality assessment, audio understanding, time series forecasting, architectural design, knowledge graphs, and video comprehension.
Multimodal Evaluation and Safety
4 papers
Benchmarks for assessing multimodal reasoning capabilities across diverse dimensions and research on adversarial robustness, including knowledge poisoning attacks and misinformation detection in multimodal settings.
π‘ Key Insights
π‘ Visual perception errors, not reasoning failures, cause most multimodal math mistakes
π‘ Process reward models catch flawed reasoning even when final answers appear correct
π‘ Latent reasoning tokens compress Chain-of-Thought to 6% without accuracy loss
π‘ Agentic tool use enables small models to outperform models ten times their size
π‘ Test-time compute scaling transfers effectively from text-only to multimodal domains
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from static benchmark evaluation (2023β2024) through RL-based reward shaping and process supervision (2024β2025) to agentic, tool-augmented reasoning systems with test-time scaling (2025β2026), with increasing emphasis on compute-efficient inference and multi-agent collaboration.
- (MM-BigBench, 2023) established the first benchmark for multimodal content comprehension where text and image carry equal semantic weight, going beyond visual-only tasks
- (MM-MATH, 2024) introduced process evaluation via LMM-as-a-Judge, revealing that diagram misinterpretation causes over 50% of errors in leading models like GPT-4o
- (GAMA, 2024) extended multimodal reasoning to audio by integrating multi-layer feature aggregation with soft semantic prompting, outperforming baselines by 1β84%
- (We-Math, 2024) pioneered knowledge-based hierarchical decomposition, exposing that many LMMs exhibit high rote memorization rates while failing prerequisite sub-problems
- MPO (Enhancing the Reasoning Ability of..., 2024) combined DPO, BCO, and SFT into a unified preference optimization framework, achieving 8B-model performance comparable to 76B models on MathVista
- URSA (Unlocking Multimodal Mathematical Reasoning via..., 2025) introduced PS-GRPO with process reward drop-moments and constructed two large-scale datasets (MMathCoT-1M, DualMath-1.1M), outperforming GPT-4o across 6 benchmarks
- Heima (Efficient Reasoning with Hidden Thinking, 2025) demonstrated that reasoning chains can be compressed to 6% of their original length by encoding steps into latent 'thinking tokens' with progressive training
- (Q-Insight, 2025) adapted GRPO to visual quality tasks, jointly optimizing score regression and degradation perception to achieve 92.77% classification accuracy
π Transition from binary outcome rewards to structured process-level supervision, enabling models to learn from intermediate reasoning quality rather than just final answer correctness.
- (MM-PRM, 2025) scaled process reward models via MCTS-based automated labeling over 700K step annotations, achieving +10.10% on out-of-distribution OlympiadBench
- GUI-Critic-R1 (Look Before You Leap, 2025) introduced pre-operative action critique with S-GRPO, preventing dangerous GUI automation errors before execution with 91.0 Exact Match score
- GUI-G2 (GUI-G2, 2025) replaced binary grounding rewards with Gaussian spatial distributions, enabling a 7B model to surpass UI-TARS-72B by 24.7 points
- Simple o3 (Simple o3, 2025) reproduced the 'thinking with images' paradigm with observe-reason-act loops integrating dynamic visual tools, surpassing GPT-4o by 27 points on MME reasoning
- DeepEyesV2 (DeepEyesV2, 2025) unified code execution and web search in a single agentic reasoning loop via cold-start SFT followed by outcome-driven RL
π Shift from passive single-pass inference to active agentic reasoning where models invoke tools, critique their own actions, and manipulate visual inputs iteratively.
- (UniT, 2026) demonstrated that test-time compute scaling transfers to multimodal generation, achieving +225% improvement on multi-turn editing at 2.5x lower cost than parallel sampling
- M3-(M3-ACE, 2026) decoupled perception from reasoning using multiple heterogeneous agents, establishing 89.1% SOTA on MathVision competition-level problems
- HouseMind (Tokenization Allows MLLMs to Understand,..., 2026) unified spatial understanding and generation through room-instance tokenization, reducing FID from 11.3 to 1.9 on layout generation
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Process Reward Supervision | Monte Carlo Tree Search (MCTS) generates step-level correctness labels to train Process Reward Models (PRMs) that detect reasoning errors at each intermediate step. | Improves on GPT-4o by +2.7% average across 6 multimodal math benchmarks, with URSA-8B achieving +20.6% absolute gain on MathVista-GPS; MM-PRM adds +10.10% accuracy on out-of-distribution OlympiadBench | Unlocking Multimodal Mathematical Reasoning via... (2025), MM-PRM (2025) |
| Shaped Reward Reinforcement Learning | Continuous reward functions (Gaussian spatial distributions, soft multi-choice rewards, mixed preference objectives) replace sparse binary signals in multimodal reinforcement learning. | Improves on UI-TARS-72B by +24.7 percentage points on ScreenSpot-Pro, with GUI-G2 achieving 47.5% accuracy using a 7B-parameter model against a 72B baseline | Enhancing the Reasoning Ability of... (2024), GUI-G2 (2025), Reinforcing Video Reasoning with Focused... (2025), Q-Insight (2025) |
| Efficient Chain-of-Thought Compression | Reasoning steps are compressed into hidden 'thinking token' representations or pruned by suppressing reflection keywords, drastically reducing generation without sacrificing accuracy. | Heima reduces generated tokens to 6% of standard CoT volume while maintaining comparable zero-shot accuracy; NoWait reduces trajectory length by 27β51% with +4.25% accuracy on AMC 2023 | Efficient Reasoning with Hidden Thinking (2025), Wait, We Don't Need to... (2025) |
| Agentic Tool-Augmented Reasoning | A two-stage training pipeline (cold-start supervised fine-tuning followed by outcome-driven RL) teaches models when and how to invoke tools and coordinate with other agents during reasoning. | Improves on Qwen3.5 by +10.2 percentage points on MathVision, with M3-ACE achieving 89.1% state-of-the-art accuracy via multi-agentic perception correction | Look Before You Leap: A... (2025), Simple o3 (2025), DeepEyesV2 (2025), M3-ACE (2026) |
| Unified Multimodal Test-Time Scaling | Budget forcing at inference compels the model to continue iterative verify-refine loops, generalizing from short training chains to arbitrarily longer inference chains. | Improves over single-pass baselines by +53.33% on MIRA out-of-distribution visual reasoning and +225.19% on ImgEdit multi-turn editing, matching best-of-N sampling at 2.5x lower cost | UniT (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MathVista | Accuracy (%) | 67.0% | Enhancing the Reasoning Ability of... (2024) |
| ScreenSpot-Pro | Accuracy (%) | 47.5% | GUI-G2 (2025) |
| MathVision | Accuracy (%) | 89.1% | M3-ACE (2026) |
| MIRA | Accuracy (% relative improvement) | +53.33% improvement over single-pass baseline | UniT (2026) |
| CLEVRER | Accuracy (%) | 50.4% | Reinforcing Video Reasoning with Focused... (2025) |
β οΈ Known Limitations (4)
- Visual perception remains the primary bottleneckβmodels misinterpret diagrams in over 50% of error cases, yet most methods assume reasonably accurate initial perception (affects: Process Reward Supervision, Shaped Reward Reinforcement Learning, Unified Multimodal Test-Time Scaling)
Potential fix: Multi-agentic perception correction (M3-ACE) and iterative visual tool use (Simple o3) decouple perception from reasoning to mitigate this bottleneck - Computational overhead from multi-step reasoning, tool invocation, and iterative refinement significantly increases inference latency, making real-time applications challenging (affects: Agentic Tool-Augmented Reasoning, Unified Multimodal Test-Time Scaling, Process Reward Supervision)
Potential fix: Latent-space reasoning (Heima) and keyword suppression (NoWait) reduce token generation by 50β94%, partially offsetting the overhead of complex reasoning pipelines - Reward hacking and length bias in RL-trained models can produce degenerate reasoning patterns that game reward signals without improving genuine understanding (affects: Shaped Reward Reinforcement Learning, Process Reward Supervision)
Potential fix: PS-GRPO uses 'drop-moment' detection to penalize correct outcomes achieved through flawed reasoning; TW-GRPO applies entropy-based token weighting to focus learning on informative tokens - Multimodal RAG systems are vulnerable to knowledge poisoning attacksβa single adversarial image can reduce accuracy to 0% across all queries via globalized poisoning (affects: Agentic Tool-Augmented Reasoning)
Potential fix: Robust retrieval mechanisms and adversarial filtering of knowledge base entries are needed, though comprehensive defenses against multimodal poisoning remain an open research problem
π View major papers in this topic (10)
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling (2026-02) 9
- MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents (2025-08) 9
- M3-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering (2026-03) 8
- GUI-G2: Gaussian Reward Modeling for GUI Grounding (2025-07) 8
- Unlocking Multimodal Mathematical Reasoning via Process Reward Model (2025-01) 8
- MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision (2025-05) 8
- Simple o3: Towards Interleaved Vision-Language Reasoning (2025-08) 8
- Efficient Reasoning with Hidden Thinking (2025-01) 8
- Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization (2024-11) 8
- MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification (2024-04) 8
π‘ Diving deeper into Multimodal Reasoning, let's examine specific research threads that define this area.
Visual and Spatial Reasoning
What: Research on enabling multimodal models to perform multi-step logical inference over visual inputs, integrating perception with reasoning for complex visual problem-solving.
Why: Bridging the gap between human-like visual cognition and current models is essential for trustworthy AI in real-world physical and scientific domains.
Baseline: Standard multimodal LLMs encode images once as static features and generate text answers via pattern matching without intermediate reasoning steps.
- Models rely on language shortcuts rather than genuine visual understanding, producing correct answers from hallucinated reasoning
- Reinforcement learning for multimodal reasoning suffers from reward sparsity, entropy collapse, and gradient vanishing on hard problems
- Spatial and 3D reasoning remains fundamentally weak, with top models barely exceeding random chance on abstract visual logic tasks
π§ͺ Running Example
Baseline: A standard MLLM would encode the image once and attempt to answer directly in text, likely hallucinating distances or confusing depth relationships because it lacks spatial grounding and intermediate reasoning steps.
Challenge: This example requires (1) genuine visual perception of depth and spatial layout, not just object recognition, (2) multi-step reasoning combining relative positions, scale estimation, and 3D understanding, and (3) grounding the answer in specific image regions rather than guessing from language priors.
π Overall Progress
The field evolved from modular pipelines composing LLMs with vision experts (2023) through a massive RL revolution driven by GRPO and its variants (2025), to sophisticated methods addressing spatial intelligence and agentic tool use (2026). A key paradigm shift was recognizing that supervised fine-tuning primarily teaches format while RL teaches transferable reasoning. Process reward models and visual chain-of-thought have emerged as complementary advances, enabling both training-time and test-time scaling for multimodal reasoning.
π Sub-topics
RL-Based Multimodal Reasoning Optimization
38 papers
Applying reinforcement learningβprimarily GRPO and its variantsβto enhance multimodal models' reasoning capabilities through verifiable rewards, addressing gradient vanishing, reward sparsity, and training instability.
Visual Chain-of-Thought Methods
14 papers
Extending chain-of-thought reasoning beyond text into the visual domain by interleaving generated images, latent visual tokens, or auxiliary diagrams as intermediate reasoning steps.
Spatial and 3D Visual Reasoning
12 papers
Enabling models to understand and reason about 3D spatial relationships, object orientations, distances, and dynamic spatial interactions from visual inputs including images and videos.
Tool-Augmented Visual Reasoning
10 papers
Enhancing multimodal models with external vision tools (detectors, depth estimators, code interpreters) and training them via RL to adaptively select and compose tools for complex visual tasks.
Benchmarks and Evaluation Frameworks
11 papers
Datasets and evaluation protocols that measure genuine visual reasoning capabilities, exposing gaps between model performance and human cognition across abstract reasoning, spatial understanding, and multimodal integration.
π‘ Key Insights
π‘ RL generalizes while SFT memorizesβreinforcement learning teaches transferable visual reasoning principles
π‘ Text-only cold start surprisingly outperforms multimodal data for initializing visual reasoning capabilities
π‘ Best models achieve near-random accuracy on abstract visual logic, revealing a massive human-AI gap
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research rapidly converged on RLVR as the dominant paradigm after DeepSeek-R1, with 2025 seeing an explosion of GRPO variants addressing multimodal-specific challenges (reward sparsity, text bias, entropy collapse), while 2026 shifts toward spatial intelligence, embodied reasoning, and adaptive tool orchestration.
- (MM-REACT, 2023) pioneered composing ChatGPT with vision experts via textual prompts
- (Visual Instruction Tuning, 2023) established the visual instruction tuning paradigm, connecting CLIP to Vicuna via simple linear projection
- (T-SciQ, 2023) achieved 96.18% on ScienceQA using mixed LLM-generated CoT signals, surpassing human performance
- (GoT, 2023) modeled non-linear reasoning as graph structures rather than sequential chains
- GPT-4V exploration (The Dawn of LMMs, 2023) systematically documented LMM capabilities including visual referring prompting
- (VisualSketchpad, 2024) enabled models to draw on images as visual reasoning steps, setting SOTA on V*Bench
π Transition from task-specific vision models to general-purpose multimodal LLMs that combine visual perception with language reasoning via instruction tuning.
- Kimi k1.5 (Kimi k1.5, 2025) matched OpenAI o1 using long-context RL with partial rollouts, without Monte Carlo Tree Search
- Vision-R1 (Vision-R1, 2025) introduced modality bridging and progressive thinking suppression for stable multimodal RL training
- (VisualPRM, 2025) built the first large-scale multimodal process reward model enabling fine-grained step-level supervision
- Cold Start study (Advancing Multimodal Reasoning via RL..., 2025) demonstrated that SFT initialization is critical, achieving 73.4% MathVista surpassing GPT-4o
- MVoT (Imagine while Reasoning in Space, 2025) introduced visual token generation during reasoning, outperforming text CoT by +20% on spatial tasks
- EMMA benchmark (Can MLLMs Reason in Multimodality?, 2025) exposed that most 'multimodal' questions can be solved without images, filtering to truly visual tasks
π Shift from supervised fine-tuning to reinforcement learning with verifiable rewards (RLVR) as the dominant training paradigm for multimodal reasoning, catalyzed by DeepSeek-R1's success.
- MMR1 (MMR1, 2025) solved GRPO gradient vanishing via Variance-Aware Sampling, achieving SOTA 58.4 across multimodal reasoning benchmarks
- World2(World2Mind, 2026) introduced training-free allocentric spatial reasoning with +17.6% on VSI-Bench
- (AdaReasoner, 2026) achieved 97.6% on spatial planning via RL-trained adaptive tool orchestration
- (Anchor-Token, 2026) identified that only ~15% of tokens are visually grounded perceptual anchors, enabling targeted reward allocation
- Compositional Visual Reasoning Survey (Explain Before You Answer, 2025) synthesized 260+ papers into a five-stage evolutionary roadmap for visual reasoning
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| RLVR-Enhanced Multimodal Reasoning | Group Relative Policy Optimization (GRPO) uses group-based reward normalization to optimize reasoning without critic models, extended with multimodal-specific stabilization techniques. | MMR1-7B achieves 58.4 avg across 5 benchmarks, surpassing R1-VL-7B (47.7) by +10.7 points; Vision-R1-7B reaches 73.5% on MathVista, near OpenAI o1's 73.9% | Kimi k1.5 (2025), MMR1 (2025), Advancing Multimodal Reasoning via Reinforcement... (2025), GPG (2025), Stable and Efficient Single-Rollout RL... (2025) |
| Visual Chain-of-Thought Reasoning | Models generate interleaved visual artifacts (images, crops, latent embeddings) as intermediate reasoning steps, bridging the semantic gap between perception and language. | MVoT outperforms text CoT by +20% on complex spatial tasks (FrozenLake 85.6% vs CoT 39.1%); MINT-CoT-7B improves +34.08% on MathVista over baseline | Imagine while Reasoning in Space:... (2025), MINT-CoT (2025), Monet (2025), MathCanvas (2025) |
| Multimodal Process Reward Models | Process Reward Models (PRMs) score individual reasoning steps via Monte Carlo estimation or consistency filtering, providing dense supervision beyond binary final-answer rewards. | VisualPRM-8B improves InternVL2.5-78B by +5.9 points across 7 benchmarks; Athena-PRM achieves 83.1 F1 on VisualProcessBench, outperforming prior best by +3.9; DreamPRM reaches 85.2% on MathVista leaderboard | VisualPRM (2025), Athena (2025), DreamPRM (2025), AutoRubric-R1V (2025) |
| Tool-Augmented Visual Reasoning | Models learn when and how to invoke external vision tools through reinforcement learning, treating tool selection as a trainable reasoning skill rather than static supervised behavior. | ReVPT-7B improves +9.82% on CV-Bench over Qwen2.5-VL-7B; AdaReasoner-7B surpasses GPT-5 on spatial planning (96.6% vs 80.1%); VisualSketchpad boosts GPT-4o by +12.7% on math tasks | MM-REACT (2023), VisualSketchpad (2024), Reinforced Visual Perception with Tools (2025), AdaReasoner (2026) |
| Allocentric Spatial and Embodied Reasoning | Converting egocentric visual observations into global allocentric representations (spatial trees, semantic orientations, grid maps) enables reasoning about absolute positions and 3D relationships. | World2Mind improves Claude-4.6-Opus by +17.6% on VSI-Bench (38.4%β56.0%); SpaceR achieves 45.6% on VSI-Bench, surpassing GPT-4o by +11.6%; vsGRPO-2B outperforms GPT-4o on visual-spatial tasks | World2Mind (2026), M2-Reasoning (2025), SoFar (2025), Embodied-Reasoner (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MathVista | Accuracy (%) | 85.2% | DreamPRM (2025) |
| VSI-Bench | Average Accuracy (%) | 56.0% | World2Mind (2026) |
| ScienceQA | Accuracy (%) | 96.18% | T-SciQ (2023) |
| CV-Bench (Spatial Reasoning) | Average Accuracy (%) | 82.3% | M2-Reasoning (2025) |
| VisuLogic | Accuracy (%) | 31.1% | VisuLogic (2025) |
β οΈ Known Limitations (4)
- Text-bias and language shortcuts: models frequently arrive at correct answers by exploiting textual patterns rather than processing visual information, producing 'right answers for wrong reasons' (affects: RLVR-Enhanced Multimodal Reasoning, Visual Chain-of-Thought Reasoning)
Potential fix: Text-bias calibration by subtracting text-only predictions from multimodal predictions; visual perception rewards that verify grounding; answer-grounding consistency metrics - Entropy collapse and reward sparsity: GRPO-based training frequently leads to premature convergence where models stop exploring, especially on hard problems where all sampled responses fail (affects: RLVR-Enhanced Multimodal Reasoning)
Potential fix: Variance-aware sampling to select prompts with mixed outcomes; latent spectral dispersion regularization; hint-guided training that provides partial solutions to unlock gradient signals on hard problems - Fundamental visual perception failures: 72-78% of reasoning errors stem from incorrect visual perception rather than flawed logic, and models perform worse on images than equivalent text descriptions (affects: RLVR-Enhanced Multimodal Reasoning, Visual Chain-of-Thought Reasoning, Allocentric Spatial and Embodied Reasoning)
Potential fix: Visual-text self-distillation to close the modality gap; dedicated visual perception reward signals during RL training; tool augmentation to offload fine-grained perception to specialist models - Scalability to small models: most advances target 7B+ parameter models, while compact models (<4B) struggle with complex multimodal reasoning and are under-explored (affects: RLVR-Enhanced Multimodal Reasoning, Visual Chain-of-Thought Reasoning)
Potential fix: Two-stage text-first RL bootstrapping before multimodal transfer; relaxed on-policy distillation from larger teachers; no-thinking and adaptive-thinking strategies that reduce computational overhead for simple tasks
π View major papers in this topic (10)
- Visual Instruction Tuning (2023-04) 9
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) (2023-09) 9
- Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025-01) 9
- T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering (2023-05) 9
- World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models (2026-03) 9
- Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers (2025-06) 9
- Explain Before You Answer: A Survey on Compositional Visual Reasoning (2025-08) 9
- QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training (2025-05) 9
- R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation (2025-05) 9
- Imagine while Reasoning in Space: Multimodal Visualization-of-Thought (2025-01) 8
π‘ Within the same paradigm, another important research direction focuses on Hallucination Mitigation.
Hallucination Mitigation
What: Research on detecting, measuring, and reducing hallucinations in Multimodal Large Language Models β outputs that contradict visual evidence or factual knowledge.
Why: Hallucinations undermine MLLM reliability in safety-critical applications like medical imaging, autonomous navigation, and visual assistants.
Baseline: Standard MLLMs generate text auto-regressively from visual features, relying on language priors without explicit grounding or verification mechanisms.
- Models over-rely on language priors and statistical co-occurrence rather than grounding responses in actual visual content
- Preference optimization often overfits to easy examples while failing on nuanced hallucination cases
- No unified benchmark covers all hallucination types across faithfulness, factuality, and reasoning dimensions
π§ͺ Running Example
Baseline: A standard MLLM might respond 'There are three dogs near a blue bench in a park with a fountain,' hallucinating an extra dog, the wrong color, and a non-existent fountain due to language priors about typical park scenes.
Challenge: This illustrates object hallucination (inventing a third dog), attribute hallucination (wrong bench color), and extrinsic hallucination (fabricating a fountain) β showing how models rely on statistical co-occurrence patterns rather than visual grounding.
π Overall Progress
The field has progressed from surface-level output correction (2023) through self-improvement and data-centric methods (2024) to process-aware evaluation and cross-modal robustness (2025β2026). A major paradigm shift occurred with the realization that correct final answers often mask severe reasoning hallucinations β shifting evaluation focus from outcomes to intermediate thinking. Concurrently, methods evolved from requiring costly human feedback to fully self-supervised approaches using self-generated preference pairs and training-free inference interventions.
π Sub-topics
Preference-Based Alignment
7 papers
Methods that use human or automated feedback with preference optimization (RLHF, DPO, GRPO) to align MLLM outputs with visual ground truth and reduce hallucinations through reward shaping.
Decoding & Inference-Time Mitigation
4 papers
Training-free methods that modify the decoding process or apply inference-time interventions (attention penalties, steering vectors, tool-based verification) to suppress hallucinations without retraining.
Grounded Reasoning & Chain-of-Thought
5 papers
Approaches that incorporate explicit visual grounding (bounding boxes, spatial coordinates) into chain-of-thought reasoning to ensure models justify answers with verifiable visual evidence.
Benchmarks & Evaluation Frameworks
7 papers
Diagnostic benchmarks and evaluation methodologies that systematically measure different hallucination types (existence, attribute, relation, faithfulness, factuality) across diverse tasks and contexts.
Data Curation & Instruction Tuning
5 papers
Methods that improve hallucination robustness through better training data design β including negative instruction examples, diverse high-quality datasets, targeted unlearning, and parameter-efficient strategies.
Domain-Specific Applications
4 papers
Hallucination mitigation techniques tailored to specific domains such as medical report generation, agriculture, and energy forecasting, where factual accuracy is critical.
Surveys & Theoretical Frameworks
4 papers
Comprehensive surveys of AGI hallucination across modalities and theoretical frameworks that formalize hallucination measurement using information geometry and cognitive science.
π‘ Key Insights
π‘ Segment-level human corrections reduce hallucinations 7Γ more efficiently than whole-response ranking
π‘ Training-free attention penalties and steering vectors eliminate hallucinations without model retraining
π‘ Correct final answers frequently mask severe hallucinations in intermediate reasoning steps
π‘ Larger models paradoxically ground worse β 72B models show lower consistency than 7B counterparts
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has moved from reactive post-hoc correction toward proactive grounding β embedding spatial verification directly into reasoning chains β while simultaneously expanding evaluation from simple object existence checks to multi-dimensional, process-aware, cross-modal benchmarks.
- LRV-Instruction (Mitigating Hallucination in Large Multi-Modal..., 2023) introduced the first large-scale negative instruction dataset with 400k examples covering 16 tasks
- Fact-RLHF (Aligning Large Multimodal Models with..., 2023) pioneered factually augmented reward models with 'cheat sheets' and created MMHal-Bench
- (OPERA, 2023) discovered columnar attention patterns as the root cause and introduced training-free decoding intervention
- (AMBER, 2023) established LLM-free, reproducible hallucination evaluation across generative and discriminative tasks
- (RLHF-V, 2023) demonstrated that segment-level DPO reduces hallucinations by 34.8% with 7Γ less data than response-level RLHF
- (Volcano, 2023) introduced single-model critique-revise-decide loops for self-correction
π Shift from treating hallucination as a secondary failure mode to a primary research target, with dedicated datasets, benchmarks, and alignment methods.
- VHTest (Visual Hallucinations of Multi-modal Large..., 2024) used CLIP/DINO discrepancy to adversarially generate diverse hallucination instances across 8 modes
- (Cantor, 2024) replaced fragmented external tools with MLLM-as-experts via prompted role-playing
- SIMA (Enhancing Visual-Language Modality Alignment via Self-Improvement, 2024) demonstrated hallucination reduction without any external models using self-generated preference pairs
- (MMInstruct, 2024) built a semi-automatic data engine achieving SOTA on 10 of 12 benchmarks
- (Pelican, 2024) introduced computational graph verification reducing hallucinations by 27% over Woodpecker
- LLM-RG4 (LLM-RG4, 2024) applied adaptive token fusion and loss weighting to eliminate input-agnostic hallucinations in medical reports
- GCoT (Grounded Chain-of-Thought for MLLMs, 2025) revealed an inverse scaling phenomenon where larger models ground worse, with 72B models showing only 11.1% consistency despite 75.7% accuracy
- (Rex-Thinker, 2025) combined structured planning-action-summarization CoT with GRPO reinforcement for 86.8% rejection accuracy
- (FlexAC, 2025) discovered middle-layer steering vectors enabling dynamic faithfulness-creativity control at inference time
- (MM-THEBench, 2026) exposed that top models achieve 70.6% answer accuracy but only 22.8% thinking correctness
- (Modality-Decoupled, 2026) introduced modality-aware invariance and sensitivity regularization achieving +27% on cross-modal hallucination tasks
- (Sharpness-Aware, 2026) formulated unlearning as a min-max game, making hallucination erasure robust against fine-tuning perturbations
- (INFACT, 2026) revealed that most video-LLMs have near-zero temporal sensitivity, relying on static cues rather than temporal understanding
π Shift from output-level hallucination detection to process-level reasoning evaluation, revealing that correct answers often mask hallucinated intermediate reasoning steps.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Preference-Based Alignment | Align model outputs with visual ground truth via Direct Preference Optimization (DPO) or Reinforcement Learning from Human Feedback (RLHF) on carefully constructed preference pairs. | RLHF-V reduces hallucination rate by 34.8% over the base MLLM using only 1.4k samples, outperforming LLaVA-RLHF which required 10k samples. MoD-DPO achieves +27% accuracy on AVHBench over vanilla DPO. | RLHF-V (2023), Aligning Large Multimodal Models with... (2023), Modality-Decoupled (2026), DA-DPO (2026), Enhancing Visual-Language Modality Alignment in... (2024) |
| Training-Free Decoding Intervention | Detect and intervene on hallucination-prone attention or activation patterns during decoding, requiring zero additional training or data. | OPERA achieves up to +35.8% improvement on the CHAIR hallucination metric over standard beam search across InstructBLIP, MiniGPT-4, LLaVA, and Shikra. FlexAC reduces CHAIR hallucination by 29% while boosting creativity 5.8Γ on Creation-MMBench. | OPERA (2023), FlexAC (2025), Pelican (2024) |
| Grounded Chain-of-Thought Reasoning | Embed explicit spatial grounding (bounding boxes, visual prompts) into chain-of-thought reasoning to ensure each reasoning step is tied to verifiable visual evidence. | GCoT achieves +55.7% improvement in Answer-Grounding Consistency over baseline LLaVA-7B (from ~11% to ~67%). Volcano achieves +24.9% on hallucination benchmarks over prior methods like Woodpecker and LLaVA-RLHF. | Grounded Chain-of-Thought for Multimodal Large... (2025), Volcano (2023), Rex-Thinker (2025), Cantor (2024) |
| Robust Data Curation & Unlearning | Curate balanced training data with explicit negative examples and diverse instructions, or erase hallucination patterns through robust unlearning that survives fine-tuning perturbations. | LRV-Instruction improves POPE accuracy by +28.2 points (from 56.8 to 85.0) on MiniGPT4. MMInstruct achieves 1626.2 on MME, surpassing LLaVA-1.5 baseline by +94.9 points. SARE reduces Chair_S from 69.6 to 37.3, outperforming EFUF baseline (43.6). | Mitigating Hallucination in Large Multi-Modal... (2023), MMInstruct (2024), Beyond Superficial Unlearning (2026), An Empirical Study on Parameter-Efficient... (2024) |
| Comprehensive Hallucination Evaluation | Construct diverse, multi-dimensional hallucination benchmarks with fine-grained taxonomies and reproducible automated metrics to replace costly human or GPT-4 evaluation. | VHTest reveals GPT-4V achieves only 38.3% accuracy on adversarially generated hallucination instances. MM-THEBench shows Qwen3-VL-235B has 70.6% answer accuracy but only 22.8% thinking correctness, exposing hidden reasoning hallucinations. | AMBER (2023), Visual Hallucinations of Multi-modal Large... (2024), MM-THEBench (2026), INFACT (2026), LongHalQA (2024) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| POPE (Polling-based Object Probing Evaluation) | Accuracy | 85.0% accuracy (Random split) | Mitigating Hallucination in Large Multi-Modal... (2023) |
| CHAIR (Caption Hallucination Assessment with Image Relevance) | CHAIR_S (sentence-level hallucination rate, lower is better) | 37.3 CHAIR_S on mPLUG-Owl (reduced from 69.6 vanilla) | Beyond Superficial Unlearning (2026) |
| MMHal-Bench | Overall Score | 60% relative improvement over baselines | Aligning Large Multimodal Models with... (2023) |
| Answer-Grounding Consistency (GCoT Metric) | Consistency Rate | +55.7% improvement over baseline LLaVA-7B | Grounded Chain-of-Thought for Multimodal Large... (2025) |
β οΈ Known Limitations (4)
- Most methods are validated only on 7B-13B parameter models, leaving uncertainty about whether findings scale to frontier-class MLLMs (70B+) or exhibit inverse scaling effects (affects: Preference-Based Alignment, Robust Data Curation & Unlearning, Grounded Chain-of-Thought Reasoning)
Potential fix: GCoT's inverse scaling finding suggests that grounding-aware training specifically needs to be applied at scale; larger models may require proportionally more grounding data or stronger architectural constraints. - Evaluation is fragmented across incompatible benchmarks (POPE, CHAIR, MMHal-Bench, AMBER) with different metrics, making cross-method comparison unreliable and potentially misleading (affects: Comprehensive Hallucination Evaluation, Preference-Based Alignment, Training-Free Decoding Intervention)
Potential fix: Unified multi-dimensional benchmarks like AMBER and MM-THEBench are moving toward standardization, but the community needs consensus on a common evaluation protocol covering faithfulness, factuality, and reasoning dimensions. - Methods that reduce hallucinations often suppress creative and associative reasoning, creating a faithfulness-creativity trade-off that limits MLLM applicability in open-ended tasks (affects: Preference-Based Alignment, Training-Free Decoding Intervention)
Potential fix: FlexAC demonstrates that steering vectors can dynamically adjust the faithfulness-creativity balance at inference time, suggesting adaptive control mechanisms as a promising direction. - Standard DPO-based unlearning achieves only superficial suppression β hallucinations catastrophically resurge after lightweight fine-tuning or parameter perturbation (affects: Preference-Based Alignment, Robust Data Curation & Unlearning)
Potential fix: SARE's min-max formulation with sharpness-aware optimization demonstrates that flattening the loss landscape around unlearned states can make erasure robust to fine-tuning perturbations.
π View major papers in this topic (10)
- RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback (2023-12) 8
- Aligning Large Multimodal Models with Factually Augmented RLHF (2023-09) 8
- OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation (2023-11) 8
- Visual Hallucinations of Multi-modal Large Language Models (2024-02) 8
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2024-07) 8
- Modality-Decoupled Direct Preference Optimization (2026-03) 8
- Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs (2026-01) 8
- FlexAC: Flexible Association Control for Multimodal Large Language Models (2025-10) 8
- Grounded Chain-of-Thought for Multimodal Large Language Models (2025-03) 7
- MM-THEBench: Do Reasoning MLLMs Think Reasonably? (2026-01) 8
π‘ Within the same paradigm, another important research direction focuses on Multimodal RLHF and Preference Alignment.
Multimodal RLHF and Preference Alignment
What: Research on aligning multimodal large language models with human preferences through reward modeling, RLHF, and direct preference optimization across vision-language tasks.
Why: Multimodal models frequently hallucinate, ignore visual details, or produce outputs misaligned with human expectations, undermining trustworthiness and usability.
Baseline: Standard supervised fine-tuning on curated vision-language datasets with single-step, scalar reward signals from human annotations.
- Step-level supervision for multimodal reasoning is expensive and difficult to automate reliably
- Reward models struggle to generalize across diverse modalities and complex reasoning traces
- Cross-modal preference alignment must jointly optimize over visual, textual, and contextual dimensions
π§ͺ Running Example
Baseline: A standard SFT-trained model generates a verbose, generic caption like 'A kitchen with many items on the counter' β missing specific objects, spatial details, and failing to adapt to surrounding webpage context.
Challenge: This example illustrates all key challenges: (1) the model's reasoning steps (identifying objects, spatial layout) lack intermediate supervision; (2) a single reward score cannot distinguish hallucinated objects from missing details; (3) the alt-text must align across the image content, the generated text, and the surrounding page context simultaneously.
π Overall Progress
The field has progressed from foundational visual-to-text alignment methods (2021) to unified any-modality frameworks (2023) to sophisticated multimodal preference optimization (2024β2025). A major paradigm shift occurred with the move from single-step, scalar rewards to step-level, multi-dimensional reward signals that verify intermediate multimodal reasoning. The latest work increasingly focuses on reward model robustness, few-shot adaptability, and cross-modal preference dimensions.
π Sub-topics
Multimodal Projection and Representation Alignment
3 papers
Methods for projecting diverse modality signals (image, audio, video) into a shared language model embedding space using lightweight adapters, enabling frozen LLMs to process multimodal inputs without modifying their weights.
Multimodal Reward Modeling
3 papers
Building reward models that provide fine-grained, multi-dimensional reward signals for multimodal reasoning, including step-level supervision via executable visual programs and few-shot activation steering.
Multimodal Preference Optimization
2 papers
Adapting preference optimization methods such as DPO and RLHF to multimodal settings, incorporating cross-modal preference pairs and negative supervision for vision-language alignment.
π‘ Key Insights
π‘ Step-level multi-dimensional rewards outperform coarse single-score supervision for multimodal reasoning
π‘ Negative supervision from rejected responses captures the core value of multimodal RLHF
π‘ Frozen LLM weights preserve factual knowledge better than fine-tuned alternatives in multimodal settings
π‘ Few-shot activation steering resists reward hacking more effectively than prompting-based approaches
π‘ Cross-modal preference pairs across visual, textual, and contextual dimensions improve alignment robustness
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from simply projecting visual features into LLM space toward building comprehensive reward and preference systems that capture fine-grained, multi-dimensional alignment across diverse modalities and reasoning steps.
- Frozen (Multimodal Few-Shot Learning with Frozen..., 2021) introduced visual prefix tuning that treats images as continuous words, achieving multimodal few-shot learning with a completely frozen LLM
π First demonstration that visual inputs can be projected into a frozen LLM's embedding space as continuous tokens, enabling multimodal reasoning without modifying the language model.
- (NExT-GPT, 2023) connected frozen encoders and diffusion decoders to an LLM core for end-to-end any-to-any multimodal generation
- (Any-Modality, 2023) demonstrated scalable 70B-parameter multimodal alignment using quantized pre-training across five modalities
π Extension from image-only alignment to unified frameworks supporting five-plus modalities (image, audio, video, motion) and bidirectional generation.
- nSFT (Continual SFT Matches Multimodal RLHF..., 2024) revealed that negative supervision is the core value of multimodal RLHF, proposing a simpler SFT-based alternative
- SVIP (Benchmarking Multimodal CoT Reward Model..., 2025) automated step-level reward annotation using executable visual programs with three-dimensional quality labels
- (Skywork-VL, 2025) achieved state-of-the-art multimodal reward modeling through dual-source data curation and two-stage training
- Activation Reward (Activation Reward Models for Few-Shot..., 2025) introduced few-shot activation steering that surpasses GPT-4o on reward hacking resistance
- (MCM-DPO, 2025) extended DPO with seven-dimensional cross-modal preference pairs for alt-text generation
π Shift from coarse final-answer rewards to fine-grained, step-level, multi-dimensional reward signals and cross-modal preference optimization.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Visual Prefix Tuning for Frozen LLMs | Treat images as continuous 'visual words' aligned to a frozen LLM's embedding space, preserving its reasoning and few-shot abilities. | Improves over fine-tuning baselines by +1.7% on OKVQA zero-shot (5.9% vs 4.2%), demonstrating frozen weights preserve factual knowledge better. | Multimodal Few-Shot Learning with Frozen... (2021) |
| Unified Any-to-Any Multimodal Alignment | Connect frozen pre-trained encoders and decoders to an LLM core via small projection layers, enabling any-input to any-output multimodal reasoning. | AnyMAL improves +7.0% relative accuracy on VQAv2 zero-shot and +8.4 CIDEr on COCO captioning over prior literature baselines. | NExT-GPT (2023), Any-Modality (2023) |
| Stepwise Visual Program Reward Modeling | Translate code execution traces into natural-language CoT steps with three-dimensional labels β Relevance, Logic, and Attribute β for fine-grained reward modeling. | Improves +6.3% on SVIP-Test for Qwen2-VL-7B over baseline, and +5.95% average with the SVIP-Reward architecture over standard fine-tuning. | Benchmarking Multimodal CoT Reward Model... (2025) |
| Scalable Multimodal Reward Models | Combine dual-source preference data from standard VLMs and advanced reasoners with two-stage training, or use activation steering for few-shot reward adaptation. | Skywork-VL achieves state-of-the-art on VL-RewardBench among open-source models; Activation Reward surpasses GPT-4o on the PreferenceHack benchmark for robustness. | Skywork-VL Reward (2025), Activation Reward Models for Few-Shot... (2025) |
| Cross-Modal Direct Preference Optimization | Optimize preferences across seven combinations of visual, textual, and contextual dimensions to teach models correct cross-modal alignment. | MCM-DPO consistently outperforms standard DPO and SFT baselines on TAlt and PAlt benchmarks, establishing new state-of-the-art for alt-text generation. | Continual SFT Matches Multimodal RLHF... (2024), MCM-DPO (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| VL-RewardBench | Accuracy | State-of-the-art among open-source models | Skywork-VL Reward (2025) |
| SVIP-Test | Accuracy | +6.3% over baseline for Qwen2-VL-7B | Benchmarking Multimodal CoT Reward Model... (2025) |
| VQAv2 (Zero-Shot) | Accuracy | +7.0% relative accuracy over baselines | Any-Modality (2023) |
| PreferenceHack | Robustness Accuracy | Surpasses GPT-4o | Activation Reward Models for Few-Shot... (2025) |
β οΈ Known Limitations (4)
- Step-level reward annotation requires executable visual programs, limiting applicability to tasks where code-based verification is feasible and excluding abstract reasoning or emotional understanding (affects: Stepwise Visual Program Reward Modeling)
Potential fix: Developing program-free step verification methods or combining code-based and neural-based verification approaches - Multimodal reward models trained on specific VLM outputs may not generalize to new model families or novel task distributions, creating a reward model specialization gap (affects: Scalable Multimodal Reward Models, Stepwise Visual Program Reward Modeling)
Potential fix: Dual-source data curation mixing standard and advanced reasoner outputs, and few-shot activation steering for rapid adaptation to new domains - Preference optimization across multiple modalities increases training complexity and risks catastrophic forgetting of text-only capabilities when fine-tuning on multimodal data (affects: Cross-Modal Direct Preference Optimization, Scalable Multimodal Reward Models)
Potential fix: Two-stage training that first aligns multimodal data then incorporates text-only data to prevent forgetting, as demonstrated by Skywork-VL Reward - Most alignment methods are evaluated on English-centric benchmarks with limited assessment of cross-cultural and multilingual multimodal alignment quality (affects: Visual Prefix Tuning for Frozen LLMs, Unified Any-to-Any Multimodal Alignment, Scalable Multimodal Reward Models)
Potential fix: Creating multilingual multimodal preference datasets and culturally diverse evaluation benchmarks
π View major papers in this topic (8)
- Multimodal Few-Shot Learning with Frozen Language Models (2021-12) 9
- NExT-GPT: Any-to-Any Multimodal LLM (2023-09) 8
- Any-Modality Augmented Language Model (AnyMAL) (2023-09) 8
- Continual SFT Matches Multimodal RLHF with Negative Supervision (2024-11) 7
- Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program (2025-04) 8
- Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning (2025-05) 8
- Activation Reward Models for Few-Shot Model Alignment (2025-07) 8
- MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation (2025-10) 7
π‘ Moving to the next paradigm, we turn to Architecture and Efficiency.
Architecture and Efficiency
What: Research on designing efficient multimodal model architectures, parameter-efficient adaptation methods, and reinforcement learning techniques that enable scalable deployment of vision-language-action systems.
Why: Deploying multimodal models in real-world settings demands methods that balance capability with computational cost, adaptability, and safety across diverse modalities.
Baseline: Full fine-tuning of large pre-trained models on downstream multimodal tasks, updating all parameters with supervised learning on task-specific labeled data.
- Full fine-tuning is computationally prohibitive and causes catastrophic forgetting of pre-trained knowledge across modalities
- Fusing heterogeneous modalities (vision, language, audio, depth) while handling missing or noisy inputs remains brittle
- Reinforcement learning for multimodal models suffers from training instability, sparse rewards, and poor transfer from language-centric paradigms to visual perception
π§ͺ Running Example
Baseline: A fully fine-tuned multimodal model would require billions of parameters updated for this task, consuming excessive memory and latency on a mobile device. It would process audio and vision independently without cross-modal reasoning, likely failing to ground the spoken reference to the correct visual region.
Challenge: This example illustrates three key challenges: (1) the model must be compact enough for on-device deployment yet capable across modalities, (2) it must fuse speech and vision to ground a spoken reference in a cluttered scene, and (3) it must segment the target object precisely β a task where supervised fine-tuning alone struggles without dense pixel-level labels.
π Overall Progress
The field has undergone two major paradigm shifts: first, from full fine-tuning to parameter-efficient adaptation (2023), proving that freezing backbones and tuning <1% of parameters often surpasses full fine-tuning; second, from supervised fine-tuning to RL-based post-training (2025-2026), where GRPO variants are being systematically adapted for multimodal perception tasks. Simultaneously, compact architectures have matured from single-modality adapters to unified multi-modal models serving vision, language, and speech within a single frozen backbone.
π Sub-topics
Parameter-Efficient Fine-Tuning & Adaptation
25 papers
Methods for adapting large pre-trained models to downstream multimodal tasks using minimal trainable parameters, including adapters, prompt tuning, LoRA variants, and spectral-domain techniques that preserve pre-trained knowledge while enabling efficient specialization.
Reinforcement Learning for Multimodal Models
30 papers
Techniques applying reinforcement learning β especially Group Relative Policy Optimization (GRPO) and its variants β to enhance multimodal model capabilities in reasoning, perception, segmentation, and GUI interaction, addressing instability, sparse rewards, and visual-domain transfer challenges.
Multi-Modal Fusion & Segmentation
30 papers
Approaches for integrating heterogeneous sensor modalities (RGB, depth, thermal, LiDAR, event cameras) into unified representations for segmentation, tracking, and scene understanding, often adapting foundation models like SAM to multi-modal inputs.
Efficient & Compact Architectures
20 papers
Design of compact multimodal models, hardware-aware acceleration, and compression techniques that enable deployment on resource-constrained devices while maintaining strong performance across vision, language, and speech tasks.
Multimodal Agent Architectures & Benchmarks
20 papers
Frameworks and evaluation benchmarks for autonomous multimodal agents that operate in real-world environments β including GUI automation, tool use, robotic control, and proactive assistance β measuring planning, grounding, and safety capabilities.
π‘ Key Insights
π‘ Frozen backbones with <1% tunable parameters frequently surpass full fine-tuning on dense visual tasks
π‘ RL post-training paradigms designed for language reasoning fail on visual perception without domain-specific adaptations
π‘ Modality-aware reward normalization reduces gradient variance by 10-13% and accelerates convergence 3x
π‘ Current multimodal agents achieve less than 50% success rate on realistic tool-use and GUI benchmarks
π‘ Compact 3-4B models with modality-specific LoRA can match performance of models twice their size
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from isolated efficiency techniques (adapters, pruning) toward holistic multimodal system design β combining parameter-efficient adaptation, RL-based training, modality-aware optimization, and agentic capabilities into unified frameworks that are both compact and capable.
- (CHARM, 2023) demonstrated 32.51x throughput gains for ViT inference by composing heterogeneous accelerators on a single chip
- (LLaMA-Adapter, 2023) introduced zero-initialized attention gating for efficient instruction tuning with only 1.2M parameters
- ViPT (Visual Prompt Multi-Modal Tracking, 2023) pioneered prompt-based multi-modal tracking, beating full fine-tuning with <1% trainable parameters
- PerSAM (Personalize Segment Anything Model with One-Shot, 2023) achieved training-free SAM personalization with just 2 learnable parameters
- (SkySense, 2023) set a new standard as a billion-scale multimodal remote sensing foundation model across 16 datasets
π Shift from full fine-tuning to frozen-backbone adaptation, proving that <1% of parameters can match or exceed full fine-tuning performance.
- Mona (5%>>>100%: Breaking Performance Shackles, 2024) broke the full fine-tuning ceiling on dense prediction tasks with multi-cognitive visual adapters
- MM-SAM (Segment Anything with Multiple Modalities, 2024) extended SAM to depth, thermal, and LiDAR with +17.5% IoU improvement and only 0.05% additional parameters
- MoE-LoRA (Customize SAM with Mixture of..., 2024) introduced dynamic expert routing for robust multi-modal segmentation, gaining +28.14% mIoU on MUSES
- (Siamese Mamba Network, 2024) replaced quadratic-complexity transformers with linear-complexity Mamba for efficient multi-modal segmentation
- (WindowsAgentArena, 2024) and (GTA, 2024) established realistic agent benchmarks revealing large gaps vs. human performance
- (MMAU, 2024) exposed that the best audio model (59.08%) drastically trails human experts (81.85%) on reasoning-intensive audio tasks
- R1-(R1-Reward, 2025) introduced StableReinforce for stable RL-based reward modeling, achieving +13.5% on VL Reward-Bench
- Seg-R1 (Seg-R1, 2025) demonstrated that RL alone can train LMMs to perform segmentation via prompt generation without pixel-level labels
- Phi-4-(Phi-4, 2025) unified vision, speech, and text in a 3.8B model via Mixture-of-LoRA, ranking #1 on OpenASR
- Dr. (Dr. Seg, 2026) revealed that perception tasks need breadth-first exploration rather than depth-first convergence used in reasoning
- (MAPLE, 2026) introduced modality-aware policy optimization, reducing gradient variance by 12.89% and converging 3.18x faster
- Fine-R1 (Fine-R1, 2026) advanced fine-grained visual recognition by +23.75% through triplet-augmented RL with chain-of-thought reasoning
- (PIRA-Bench, 2026) shifted the agent paradigm from reactive execution to proactive intent recommendation
π Reinforcement learning became a primary post-training paradigm for multimodal models, with GRPO variants tailored for visual perception rather than just language reasoning.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Parameter-Efficient Visual Adaptation | Insert small, learnable modules (adapters, prompts, or spectral transforms) into frozen backbones to bridge domain gaps with minimal parameter overhead. | Mona surpasses full fine-tuning on COCO by +1.0% mAP and Pascal VOC by +3.6% AP using only 5% of trainable parameters. PointGST achieves 99.48% on ScanObjectNN with 0.67% trainable parameters, outperforming full fine-tuning by +1.6%. | LLaMA-Adapter (2023), 5%>>>100%: Breaking Performance Shackles of... (2024), Revisiting the Power of Prompt... (2024), Parameter-Efficient (2024), Adaptive Capacity Allocation for Vision... (2026) |
| RL-Enhanced Multimodal Training | Replace or augment supervised fine-tuning with policy optimization that uses verifiable rewards (mask IoU, correctness) to train multimodal models for perception and reasoning tasks. | R1-Reward achieves +13.5% on VL Reward-Bench over state-of-the-art via StableReinforce. Dr. Seg gains +2.0 gIoU on ReasonSeg and +2.4 AP on COCO detection over standard GRPO. MAPLE converges 3.18x faster than modality-blind training. | R1-Reward (2025), Seg-R1 (2025), Dr. Seg (2026), MAPLE (2026), Fine-R1 (2026) |
| Foundation Model Multi-Modal Adaptation | Freeze the foundation model's core weights and inject modality-specific adapters or prompt mechanisms to bridge the gap between RGB-trained representations and new sensor modalities. | MM-SAM improves over vanilla SAM by +17.5% IoU on RGB-Thermal and +28.3% IoU on LiDAR with only 0.05% additional parameters. MoE-LoRA gains +28.14% mIoU on MUSES 3-modality segmentation over state-of-the-art. ViPT beats full fine-tuning by +10.5% success rate on LasHeR with <1% trainable parameters. | Visual Prompt Multi-Modal Tracking (2023), Segment Anything with Multiple Modalities (2024), Customize Segment Anything Model for... (2024), Prompting Multi-Modal Image Segmentation with... (2024), PERSONALIZE (2023) |
| Compact Multimodal Architectures | Attach modality-specific lightweight adapters to a frozen compact language backbone and use dynamic processing strategies to handle diverse inputs without scaling model size. | Phi-4-Multimodal matches models twice its size on math and coding while ranking first on OpenASR with only 460M speech LoRA parameters. Ovis2.5-9B achieves 78.3 on OpenCompass, setting SOTA for open-source MLLMs under 40B. VLA-Adapter trains a full VLA in 8 hours on a single consumer GPU. | Phi-4-Mini (2025), Ovis2.5 Technical Report (2025), VLA-Adapter (2025), CHARM (2023) |
| Multimodal Agent Frameworks | Equip multimodal models with environment interaction capabilities (clicking, typing, API calls) and evaluate on realistic, implicit-planning tasks rather than simplified text-only benchmarks. | InfiGUI-G1 achieves up to 9.0% relative improvement over naive RLVR baselines on GUI grounding via Adaptive Exploration Policy Optimization. MobileGUI-RL generates curriculum tasks via random walks and GPT-4o reverse-engineering for scalable online training. | WindowsAgentArena (2024), GTA (2024), PIRA-Bench (2026), InfiGUI-G1 (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| OpenCompass Multimodal Leaderboard | Average Score | 78.3 (SOTA for open-source <40B) | Ovis2.5 Technical Report (2025) |
| ScanObjectNN (OBJ_BG) | Accuracy | 99.48% | Parameter-Efficient (2024) |
| VL Reward-Bench | Accuracy | +13.5% over SOTA (with inference-time scaling) | R1-Reward (2025) |
| MUSES (3-modality semantic segmentation) | mIoU | +28.14% mIoU over SOTA | Customize Segment Anything Model for... (2024) |
| MMAU (Massive Multi-Task Audio Understanding) | Accuracy | 59.08% (cascaded approach) vs. 81.85% human baseline | MMAU (2024) |
β οΈ Known Limitations (4)
- Parameter-efficient methods are validated primarily on classification and segmentation but struggle with open-ended generation tasks where the full model capacity is needed for creative and diverse outputs. (affects: Parameter-Efficient Visual Adaptation, Foundation Model Multi-Modal Adaptation)
Potential fix: Model Tailor's sparse patching with Hessian-based decoration selects the minimal parameter subset to update, reducing forgetting while maintaining target task performance. - RL-based multimodal training suffers from reward sparsity and training instability, especially for long-horizon tasks like GUI automation where the outcome is only observable after many steps. (affects: RL-Enhanced Multimodal Training, Multimodal Agent Frameworks)
Potential fix: DeepVideo-R1 reformulates RL as a regression task with difficulty-aware augmentation; SAPO replaces hard clipping with smooth sigmoid gates and asymmetric temperatures to stabilize training. - Multi-modal fusion methods assume all modalities are available at inference time, but real-world deployments frequently face missing, corrupted, or asynchronous sensor inputs that degrade performance. (affects: Foundation Model Multi-Modal Adaptation, Compact Multimodal Architectures)
Potential fix: DrFuse decomposes representations into shared and distinct components so the shared part can be inferred from any available modality. MoE-LoRA's dynamic routing gracefully degrades by assigning zero weight to missing modality experts. - Safety alignment degrades significantly when reasoning capabilities are added to multimodal models, with MLRMs exhibiting 37% higher jailbreaking success rates than base models. (affects: RL-Enhanced Multimodal Training, Compact Multimodal Architectures)
Potential fix: Safe RLHF-V decouples helpfulness and safety into separate reward streams with Lagrangian-constrained optimization. SafeMLRM identifies 'emergent self-correction' where 16.23% of unsafe reasoning chains are overridden by safe final answers.
π View major papers in this topic (10)
- Phi-4-Mini and Phi-4-Multimodal (2025-04) 9
- SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery (2023-12) 9
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (2024-10) 9
- LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention (2023-03) 8
- 5%>>>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks (2024-08) 8
- R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning (2025-05) 8
- Segment Anything with Multiple Modalities (2024-08) 8
- MAPLE: Modality-Aware Post-training and Learning Ecosystem (2026-02) 8
- WindowsAgentArena: Evaluating Multi-Modal OS Agents at Scale (2024-09) 8
- Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning (2025-06) 8
π‘ Diving deeper into Architecture and Efficiency, let's examine specific research threads that define this area.
Visual Encoders and Projections
What: Research on designing, compressing, and adapting Vision Transformer architectures and their projection layers for efficient multi-modal understanding and deployment on resource-constrained devices.
Why: Vision Transformers are powerful but computationally expensive, limiting deployment on edge devices and real-time multi-modal applications that require fast visual reasoning.
Baseline: Standard full-precision Vision Transformers with multi-head self-attention, extracting final-layer features projected through dense linear adapters into language model space.
- Non-normal activation distributions from Softmax and GELU make standard quantization techniques fail on Vision Transformers
- Multi-head attention introduces memory access bottlenecks that limit real-world inference speed despite low FLOP counts
- Dense parameter-shared projection layers create gradient conflicts when aligning heterogeneous modalities like vision, speech, and text
π§ͺ Running Example
Baseline: A standard ViT-B backbone requires ~17.6 GFLOPs and 86M parameters at full precision, causing 2+ second latency on mobile. The final-layer features miss fine-grained text details, and a dense projection adapter struggles to align visual features with the language model.
Challenge: The menu has tiny text requiring fine-grained spatial detail (lost in aggressive quantization), dim lighting creates outlier activations in LayerNorm (breaking standard quantizers), and the model must project visual features into language space efficiently (single dense adapter bottleneck).
π Overall Progress
The field has evolved from adapting CNN compression techniques to ViTs to developing ViT-native solutions that exploit architectural properties like power-law activations and channel-wise outliers. A major paradigm shift occurred around 2024 when researchers recognized that ViT activations require fundamentally different quantization approaches than CNNs. More recently, the focus has broadened beyond pure efficiency to include structured pretraining with LLM supervision, sparse expert-based multi-modal projection, and mechanistic understanding of what visual encoders learn internally β reflecting a maturation from compressing encoders to comprehensively designing them.
π Sub-topics
Post-Training Quantization for Vision Transformers
16 papers
Methods for compressing pre-trained ViTs to low bit-widths (3-8 bit) without retraining, addressing unique challenges from non-normal activation distributions in Softmax, GELU, and LayerNorm operations. The largest research cluster in this topic.
Efficient Vision Transformer Architectures
3 papers
Macro and micro architectural innovations that reduce memory access costs and computational redundancy in ViTs for real-time deployment on edge devices, including single-head designs and sandwich layouts.
Multi-modal Feature Projection and Alignment
8 papers
Adapter and projection architectures that bridge visual encoders with language models, including sparse expert routing, progressive context extension, layer-wise feature fusion, and modality-specific alignment strategies.
Visual Encoder Pretraining and Fine-tuning
12 papers
Novel strategies for pretraining visual encoders with structured LLM supervision, reinforcement learning rewards, multi-modal contrastive objectives, and efficient adaptation techniques that improve robustness and transferability.
Visual Encoder Interpretability and Robustness
5 papers
Research on understanding internal mechanisms of vision transformers, locating demographic biases at the attention-head level, analyzing action-outcome circuits, and improving robustness to distribution shifts through mechanistic analysis.
π‘ Key Insights
π‘ ViT activations require fundamentally different quantization than CNNs due to non-normal distributions
π‘ Memory access cost, not FLOPs, is the true bottleneck for on-device ViT inference speed
π‘ Middle ViT layers often outperform final layers by 20% for spatial and fine-grained tasks
π‘ Frozen LLMs can supervise visual encoder pretraining more effectively than free-form text with far less data
π‘ Sparse expert routing resolves gradient conflicts that plague dense multi-modal projection adapters
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research progressed from basic ViT quantization and efficient architectures (2023) through specialized distribution-aware methods and hardware co-design (2024) to structured LLM-supervised pretraining, multi-modal projection, and mechanistic interpretability (2025-2026), reflecting a shift from pure compression to holistic visual encoder design.
- (TSPTQ-ViT, 2023) introduced two-scaled quantization separating activation magnitudes for Softmax and GELU, achieving <0.5% accuracy drop at 8-bit
- (EfficientViT, 2023) established memory-efficient ViT design with the sandwich layout and cascaded group attention, running 5.8x faster than MobileViT
- I&S-ViT (I&S-ViT, 2023) achieved a breakthrough +50.68% accuracy recovery at 3-bit through the Shift-Uniform-Log2 Quantizer (SULQ) and smooth optimization strategy
- (SHViT, 2024) demonstrated that single-head attention on partial channels matches multi-head performance while being 2.4x faster on iPhone 12
- P2-(P2-ViT, 2024) pioneered power-of-two quantization with a dedicated hardware accelerator, achieving 10.1x speedup over GPU Tensor Cores
- (ADFQ-ViT, 2024) and (DopQ-ViT, 2024) independently tackled outlier-aware and distribution-friendly quantization for ViT activations
- PTQ4(PTQ4SAM, 2024) extended ViT quantization to foundation models, addressing bimodal distributions unique to the Segment Anything Model with 3.9x FLOPs reduction
- (COMQ, 2024) eliminated backpropagation from PTQ entirely via coordinate descent, achieving <1% accuracy loss at 4-bit
π Research shifted from adapting CNN quantization techniques to designing ViT-native quantizers that explicitly model power-law and bimodal activation distributions.
- VITA-1.5 (VITA-1.5, 2025) demonstrated three-stage progressive training to integrate vision, audio, and speech without modality interference, approaching GPT-4o capabilities
- (Long-VITA, 2025) scaled visual-language models to 1 million tokens through phased context-length training with logits-masked inference
- (AIQViT, 2025) introduced learnable low-rank adapters alongside quantized weights for architecture-informed compensation across 5 vision tasks
- (APHQ-ViT, 2025) replaced GELU with ReLU via knowledge distillation and introduced perturbation-based Hessian estimation, outperforming prior methods by up to 30% at 3-bit
- (Shallower Layers, 2025) proved middle ViT layers outperform the conventionally-used final layers by 20% on spatial tasks, challenging a widespread design assumption
- (VIVID-Med, 2026) used a frozen LLM as a structured semantic teacher to pretrain deployable medical ViTs, outperforming BiomedCLIP with 500x less data
- (MoE-Adapter, 2026) resolved gradient conflicts in multi-modal projection through sparse expert routing with load-balancing
- (Bias in CLIP, 2026) located demographic bias at individual attention heads in CLIP, enabling targeted debiasing by ablating just 4 heads
π Focus expanded from pure compression and efficiency to understanding and steering what visual encoders learn, using LLMs as semantic teachers and mechanistic interpretability to locate and remove biases.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Distribution-Aware Post-Training ViT Quantization | Tailored quantization schemes that split activations by magnitude, use adaptive logarithmic bases, or separate outlier channels to preserve precision where standard uniform quantizers fail. | I&S-ViT improves on RepQ-ViT by +50.68% Top-1 accuracy on 3-bit ViT-B; ADFQ-ViT improves on RepQ-ViT by +10.23% Top-1 on 4-bit ViT-B; APHQ-ViT outperforms PTQ4ViT by +3.65% Top-1 on ViT-B at 4-bit, achieving 78.43% | APHQ-ViT (2025), I&S-ViT: An Inclusive & Stable... (2023), ADFQ-ViT (2024), DopQ-ViT (2024), AdaLog (2024) |
| Hardware-Accelerated ViT Inference | Power-of-Two (PoT) scaling factors enable pure bit-shift re-quantization, eliminating costly floating-point operations and enabling dedicated ViT accelerator chips with pipelined dataflows. | P2-ViT achieves 10.1x speedup and 36.8x energy savings over GPU Turing Tensor Cores while maintaining 81.39% Top-1 on ImageNet for ViT-B; Trio-ViT delivers 7.3x FPS improvement over ViTCoD accelerator | P2-ViT (2024), Trio-ViT (2024), AIQViT (2025) |
| Memory-Efficient ViT Architecture | Single-head attention on partial channels combined with large-stride patchification and sandwich FFN layouts eliminate redundant memory operations without sacrificing accuracy. | SHViT-S4 outperforms MobileViTv2-1.0 by +1.3% accuracy while being 2.4x faster on iPhone 12; EfficientViT-M5 surpasses MobileNetV3-Large by +1.9% while running 40.4% faster on V100 GPU | SHViT (2024), EfficientViT (2023) |
| Sparse Expert Multi-modal Projection | A learnable router directs different modality segments to dedicated experts, isolating conflicting optimization gradients while progressive training prevents cross-modal interference. | MoE-Adapter improves on dense baselines by +3.75% accuracy on OpenBookQA (50.10% to 53.85%) and reduces the audio-text modality gap from -17.83 to -14.67 on MMSU | MoE-Adapter (2026), VITA-1.5 (2025), Long-VITA (2025), Multimodal Language Models See Better... (2025) |
| Structured Visual Encoder Pretraining | Using LLM-generated structured schemas or RL-based reward functions as supervision produces visual encoders with stronger generalization and fewer spurious correlations than conventional one-hot or free-text training. | VIVID-Med outperforms BiomedCLIP by +6.65 points macro-AUC on CheXpert linear probing (achieving 0.8588) despite using 500x less pretraining data; GRPO-RM achieves +4.26% average accuracy on out-of-distribution datasets over standard fine-tuning | VIVID-Med (2026), GRPO-RM (2025), Pretrained Visual Uncertainties (2024), Concept-Guided Fine-Tuning (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| ImageNet-1K Classification (4-bit W4/A4 ViT-B) | Top-1 Accuracy | 81.3% Top-1 Accuracy | AIQViT (2025) |
| ImageNet-1K Classification (3-bit ViT-B) | Top-1 Accuracy Recovery | +50.68% accuracy recovery over prior state-of-the-art | I&S-ViT: An Inclusive & Stable... (2023) |
| ImageNet-1K Efficient ViT Inference Speed | Top-1 Accuracy at matched latency | SHViT-S4: 2.4x faster than MobileViTv2-1.0 at +1.3% higher accuracy on iPhone 12 | SHViT (2024) |
| CheXpert Linear Probing | Macro-AUC | 0.8588 macro-AUC | VIVID-Med (2026) |
| ViT Hardware Energy Efficiency | Speedup and energy savings vs GPU Tensor Cores | 10.1x speedup, 36.8x energy savings over GPU Turing Tensor Cores | P2-ViT (2024) |
β οΈ Known Limitations (4)
- Most ViT quantization methods are validated primarily on ImageNet classification; generalization to diverse downstream tasks such as detection, segmentation, and generation remains underexplored and may not transfer directly. (affects: Distribution-Aware Post-Training ViT Quantization, Hardware-Accelerated ViT Inference)
Potential fix: PTQ4SAM and ERQ have begun extending quantization to SAM and dense prediction models; more systematic cross-task evaluation frameworks are needed. - Ultra-low-bit quantization (3-bit) still incurs significant accuracy drops on complex hierarchical architectures like Swin Transformers, making practical deployment below 4-bit challenging for production systems. (affects: Distribution-Aware Post-Training ViT Quantization)
Potential fix: Combining distribution-aware quantizers with low-rank compensation (AIQViT) or GELU-to-ReLU substitution via knowledge distillation (APHQ-ViT) shows promise for pushing below 4-bit. - Efficient ViT architectures sacrifice fine-grained spatial information for speed β single-head designs and large-stride stems aggressively reduce tokens, which may hurt dense prediction tasks requiring per-pixel precision. (affects: Memory-Efficient ViT Architecture)
Potential fix: Hybrid approaches combining partial-channel attention with local depthwise convolutions (as in SHViT) partially address this, but dedicated efficient ViTs for dense prediction remain an open area. - Multi-modal projection methods are evaluated on different benchmarks with different LLM backbones and training data scales, making fair comparison across projection architectures extremely difficult. (affects: Sparse Expert Multi-modal Projection, Structured Visual Encoder Pretraining)
Potential fix: Standardized multi-modal benchmarks like Creation-MMBench are emerging to enable fairer comparisons; the community needs agreed-upon evaluation protocols for projection layers.
π View major papers in this topic (10)
- I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization (2023-11) 8
- APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers (2025-04) 8
- P2-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer (2024-05) 8
- ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers (2024-07) 8
- COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization (2024-03) 8
- PTQ4SAM: Post-Training Quantization for Segment Anything (2024-05) 8
- SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design (2024-01) 7
- VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs (2026-03) 8
- VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction (2025-01) 8
- Multimodal Language Models See Better When They Look Shallower (2025-11) 7
π‘ Within the same paradigm, another important research direction focuses on Token Compression and Efficient Inference.
Token Compression and Efficient Inference
What: Research on compressing model representationsβthrough quantization, token pruning, and distillationβto enable efficient inference for vision and multimodal models on resource-constrained devices.
Why: Deploying large vision transformers and multimodal LLMs in real-time applications demands dramatic reductions in memory, latency, and compute without sacrificing accuracy.
Baseline: Full-precision (FP32) models with all visual tokens processed through every layer, requiring maximum memory and compute resources.
- Vision transformer activations exhibit extreme outliers and non-normal distributions that break standard quantizers
- Visual token redundancy in multimodal LLMs wastes compute on background regions irrelevant to the task
- Aggressive compression at ultra-low bit-widths causes catastrophic accuracy collapse in safety-critical applications
π§ͺ Running Example
Baseline: The full-precision VLA model processes all 576 visual tokens through all 32 LLM layers at FP32 precision, consuming 14GB of memoryβfar exceeding the 4GB budgetβand producing inference at 2 FPS, too slow for real-time grasping.
Challenge: The scene has mostly irrelevant background tokens (table surface, walls), the model's post-Softmax activations follow a power-law distribution that breaks naive INT8 quantization, and the small cup requires preserving fine-grained detail tokens despite compression.
π Overall Progress
The field has progressed from recognizing that standard quantization fails on vision transformers (2023), through developing specialized solutions for diverse vision domains like SAM, LiDAR, and diffusion models (2024), to integrating quantization with token compression and dynamic routing in production multimodal systems (2025β2026). A key paradigm shift is the move from treating efficiency as purely a post-hoc compression problem to designing architectures with built-in elastic inference capabilities, as seen in Matryoshka representations and visual resolution routing.
π Sub-topics
Post-Training Quantization for Vision Models
30 papers
Methods that quantize pretrained vision transformers and related architectures to low bit-widths (3β8 bit) without retraining, addressing challenges like outlier activations, non-normal distributions, and hardware compatibility.
Visual Token Reduction for Multimodal LLMs
8 papers
Techniques that reduce the number or dimensionality of visual tokens fed into large multimodal models, including token pruning, merging, early-layer bypass, and dynamic resolution routing.
Knowledge Distillation for Efficient Multimodal Inference
14 papers
Methods that transfer knowledge from large teacher models to compact student models, enabling deployment of multimodal capabilities on edge devices through cross-modal, competitive, or chain-of-thought distillation.
Discrete Tokenization and Efficient Inference Infrastructure
10 papers
Surveys and frameworks covering discrete tokenizer design, training infrastructure for efficient fine-tuning, and semantic compression for bandwidth-constrained multimodal systems.
π‘ Key Insights
π‘ Matching quantizer shape to activation distribution is critical for ViT accuracy at low bit-widths.
π‘ Pruning 76% of visual tokens preserves accuracy while halving compute in multimodal LLMs.
π‘ Domain-specific PTQ calibration matches full-precision performance for safety-critical deployments.
π‘ Dynamic resolution routing enables 4Γ inference speedup with negligible accuracy loss.
π‘ Small distilled models achieve 94% of large model quality at 80Γ less computational cost.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from layer-by-layer quantization fixes toward holistic efficiency frameworks that combine multiple compression strategiesβquantization, token pruning, distillation, and dynamic resolutionβinto unified systems capable of adapting compute budgets at inference time.
- (TSPTQ-ViT, 2023) introduced two-scaled quantization splitting activations by magnitude for Softmax and GeLU outputs
- MRECG (Solving Oscillation Problem in PTQ, 2023) theoretically proved oscillation in PTQ and proposed mixed reconstruction granularity, gaining +6.61% on MobileNetV2
- I&S-ViT (I&S-ViT, 2023) achieved stable 3-bit ViT quantization with shift-uniform-log2 quantizer, elevating accuracy by 50.68% over RepQ-ViT
- FP8 PTQ study (Efficient Post-training Quantization with FP8, 2023) demonstrated FP8 covers 92.64% of workloads versus only 65.87% for INT8 across 75 diverse models
π Recognition that standard CNN quantization methods fail on Vision Transformers due to fundamentally different activation distributions from Softmax, GELU, and LayerNorm.
- (LiDAR-PTQ, 2024) pioneered point-cloud-aware quantization with sparsity calibration, achieving near-lossless INT8 on Waymo with 3Γ speedup
- (RepQuant, 2024) introduced quantization-inference decoupling via scale reparameterization, gaining +30.7% on ViT-S at W4/A4
- PTQ4(PTQ4SAM, 2024) solved SAM's bimodal distribution problem with bimodal integration and adaptive granularity quantization
- VQ4(VQ4DiT, 2024) enabled 2-bit diffusion transformers through simultaneous codebook and assignment calibration, achieving 3.32 FID
- (Visual-Modality, 2024) showed that compressing redundant visual tokens improves MLLM instruction-following by +9.5%
π Shift from generic ViT PTQ to domain-specific quantization (SAM, LiDAR, diffusion, super-resolution) and the first visual token compression methods for multimodal LLMs.
- InternVL3.5 (InternVL3.5, 2025) introduced Visual Resolution Router achieving 4.05Γ speedup while scoring 77.7 on MMMU, narrowing the gap with GPT-5
- (METEOR, 2025) demonstrated progressive multi-stage pruning reducing tokens by 76% with only 0.3% accuracy drop across 11 benchmarks
- (APHQ-ViT, 2025) replaced Fisher approximation with direct perturbation Hessian for robust ultra-low-bit quantization
- (FMVR, 2026) achieved 89% FLOPs reduction via frequency-modulated Matryoshka visual restoration, outperforming FastV by up to 7%
- (Render-of-Thought, 2026) pioneered compressing chain-of-thought reasoning into visual tokens for 3β4Γ token compression with 4.6Γ speedup
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Distribution-Aware Vision Transformer Quantization | Match quantizer shape to activation distribution (power-law for Softmax, outlier-prone for LayerNorm) rather than forcing a uniform grid. | RepQuant improves on PTQ4ViT by +30.7% accuracy on ImageNet ViT-S at W4/A4, achieving 73.28% Top-1; APHQ-ViT outperforms PTQ4ViT by +3.65% on ViT-B at 4-bit, reaching 78.43%. | RepQuant (2024), APHQ-ViT (2025), I&S-ViT: An Inclusive & Stable... (2023), ADFQ-ViT (2024), DopQ-ViT (2024) |
| Task-Specific Post-Training Quantization | Design calibration and quantization strategies tailored to the unique activation patterns of each vision domain rather than applying generic ViT PTQ. | LiDAR-PTQ achieves 60.12 mAPH on Waymo CenterPoint-Pillar (INT8), matching FP32 baseline (60.32) and outperforming BRECQ by +3.87 mAPH; PTQ4SAM achieves lossless 6-bit SAM-L with 3.9Γ FLOPs reduction. | LiDAR-PTQ (2024), PTQ4SAM (2024), VQ4DiT (2024), 2DQuant: Low-bit Post-Training Quantization for... (2024), Post-Training (2025) |
| Progressive Visual Token Pruning | Identify and remove redundant visual tokens at encoding, fusion, and decoding stages using attention scores, information rank, or frequency decomposition. | METEOR reduces visual tokens by 76% over EAGLE baseline with only 0.3% accuracy drop and outperforms FastV by +4.1% average across 11 benchmarks; FMVR reduces FLOPs by 89% while maintaining ~100% of LLaVA-1.5-7B accuracy. | METEOR (2025), Frequency-Modulated (2026), DeepInsert (2025), Visual-Modality (2024) |
| Dynamic Visual Resolution Routing | Let the model learn to allocate visual compute adaptively based on content complexity rather than using a fixed resolution for all inputs. | InternVL3.5 achieves 4.05Γ speedup over InternVL3 with Visual Resolution Router while scoring 77.7 on MMMU, narrowing the gap with GPT-5 to 3.9%; Long-VITA extends context to 1M tokens with 2Γ prefill speedup. | InternVL3.5 (2025), Long-VITA (2025), Render-of-Thought (2026) |
| Cross-Modal Knowledge Distillation | Use competitive, chain-of-thought, or manifold-alignment distillation to transfer multimodal understanding from large teachers to lightweight students without requiring multimodal inputs at inference. | UDRL-SLM achieves relevance within 6% of Llama-3 8B with 80Γ fewer parameters (100M vs 8B) at 338 tokens/second; CoMD's 7B student surpasses its 13B teacher by +1.47% on ScienceQA, reaching 91.83%. | Unlock the Power (2023), From Images to Words: Efficient... (2026), Scaling Multimodal Search and Recommendation... (2025), CoT-Drive (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| ImageNet Classification (ViT PTQ) | Top-1 Accuracy | 78.43% | APHQ-ViT (2025) |
| Waymo Open Dataset (LiDAR PTQ) | mAPH Level 2 | 60.12 mAPH | LiDAR-PTQ (2024) |
| MMMU (Multimodal Understanding) | Accuracy | 77.7% | InternVL3.5 (2025) |
| EAGLE Multi-Encoder Benchmarks (Token Pruning) | Average Score across 11 benchmarks | 76% token reduction with only 0.3% average accuracy drop | METEOR (2025) |
β οΈ Known Limitations (4)
- Ultra-low-bit quantization (3-bit or below) still causes significant accuracy drops on complex downstream tasks like detection and segmentation, even with specialized quantizers. (affects: Distribution-Aware Vision Transformer Quantization, Task-Specific Post-Training Quantization)
Potential fix: Combining quantization with low-rank compensation (AIQViT) or MLP reconstruction with activation substitution (APHQ-ViT) can partially mitigate ultra-low-bit degradation. - Token pruning methods rely on attention-based importance scores that may discard visually subtle but semantically critical tokens, particularly for fine-grained tasks like OCR or small object detection. (affects: Progressive Visual Token Pruning, Dynamic Visual Resolution Routing)
Potential fix: Task-adaptive retention strategies (like METEOR's Visual Attention Value for OCR) and frequency-based restoration (FMVR) can recover fine-grained details lost during aggressive pruning. - Knowledge distillation requires access to a powerful teacher model and significant compute for generating training data, creating a dependency bottleneck for resource-constrained teams. (affects: Cross-Modal Knowledge Distillation)
Potential fix: Black-box distillation methods like ARMADA that only need teacher outputs rather than weights, and synthetic data generation approaches, reduce the barrier to effective distillation. - Most PTQ methods are evaluated primarily on ImageNet classification; generalization to diverse downstream tasks (video, 3D, medical imaging) remains underexplored and inconsistent. (affects: Distribution-Aware Vision Transformer Quantization, Task-Specific Post-Training Quantization)
Potential fix: Task-guided supervision losses (LiDAR-PTQ) and temporal-aware calibration (PTQ4VM) demonstrate that incorporating task-specific priors during calibration improves cross-domain generalization.
π View major papers in this topic (8)
- InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency (2025-08) 9
- LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection (2024-01) 9
- RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization (2024-02) 8
- METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models (2025-08) 8
- APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers (2025-04) 8
- Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models (2026-03) 8
- PTQ4SAM: Post-Training Quantization for Segment Anything (2024-05) 8
- From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval (2025-02) 8
π‘ Within the same paradigm, another important research direction focuses on Multimodal Pretraining and Instruction Tuning.
Multimodal Pretraining and Instruction Tuning
What: Research on pretraining models to jointly understand multiple modalities (vision, language, 3D, audio) and fine-tuning them to follow complex multimodal instructions.
Why: Enabling AI systems to perceive and reason across diverse data types is essential for real-world applications from medical diagnosis to embodied interaction.
Baseline: Standard approach uses frozen pretrained encoders (e.g., CLIP) with simple linear probing or basic visual question-answering fine-tuning on paired data.
- Aligning heterogeneous modality representations into a coherent shared embedding space without losing modality-specific information
- Preventing capability degradation of the language backbone when fine-tuning on visual instruction data
- Scaling to new domains and modalities with limited paired training data
π§ͺ Running Example
Baseline: A generic CLIP-based model can match the image to broad disease categories but cannot produce detailed clinical descriptions, misses subtle findings, and may hallucinate non-existent pathology due to weak vision-language alignment in medical domains.
Challenge: The retinal image requires domain-specific visual understanding (fine-grained lesion detection), language generation aligned with clinical terminology, and the model must avoid confident hallucinations that could mislead clinical decisions.
π Overall Progress
The field has evolved from extending CLIP to new modalities (3D, audio, medical) toward sophisticated alignment techniques that preserve language capabilities while improving visual understanding. A key paradigm shift occurred with the discovery of modality degradation and the adoption of preference optimization (DPO) as a lightweight fix. Theoretical foundations now explain why multi-modal learning inherently outperforms single-modal approaches through noise suppression, and practical deployment has been enabled through domain-specific pretraining that achieves competitive performance with orders of magnitude less data.
π Sub-topics
Contrastive Vision-Language Pretraining
12 papers
Methods extending CLIP-style contrastive learning to align multiple modalities including vision, language, audio, and 3D data, with innovations in temperature scheduling, multi-view alignment, and theoretical foundations explaining why multi-modal learning outperforms single-modal approaches.
Instruction Tuning and Preference Alignment
10 papers
Techniques for aligning multimodal LLMs with human preferences through instruction data curation, direct preference optimization, and competitive distillation to improve open-ended conversation quality while preventing language capability degradation.
3D-Language-Image Pretraining
6 papers
Approaches bridging the gap between 3D point cloud understanding and 2D vision-language models through proxy-based alignment, multi-view distillation, and adapter-based transfer learning for embodied interaction and shape retrieval.
Domain-Specific Multimodal Pretraining
9 papers
Specialized pretraining frameworks for medical imaging, biosignals, urban computing, and earth observation that adapt general-purpose multi-modal learning to data-scarce, high-stakes domains using knowledge-enhanced and expert-guided strategies.
Multimodal Architecture and Mixing Strategies
8 papers
Architectural innovations for combining multiple encoders, mixing model weights from different training domains, and enabling tool-augmented multimodal agents with continual learning capabilities.
π‘ Key Insights
π‘ Multi-modal contrastive learning theoretically eliminates noise memorization that limits single-modal approaches
π‘ Preference optimization with just 5K samples reverses language degradation from visual instruction tuning
π‘ Domain-specialized pretraining with elite data matches models trained on 100x more private data
π‘ Model weight merging outperforms LoRA and full fine-tuning for continual multimodal updates
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from general-purpose contrastive pretraining toward domain-specialized, alignment-aware multimodal models with increasing emphasis on data efficiency, deployment readiness, and theoretical understanding of cross-modal learning dynamics.
- CLIP2 (CLIP2, 2023) pioneered real-world 3D-language alignment using automatically mined proxy triplets, achieving +253% improvement over PointCLIP on outdoor recognition
- (SPHINX, 2023) introduced three-fold mixing of weights, tasks, and visual embeddings to create versatile MLLMs with 90.8 POPE score
- (LLaVA-Plus, 2023) established the paradigm of end-to-end visual tool learning with image-grounded planning, reaching 1203 Elo on VisIT-Bench
- CoMD (Competitive Distillation for Multi-Modal LLMs, 2023) demonstrated that a 7B student model can surpass its 13B teacher through iterative competitive distillation
- (Multi-modal Preference Alignment, 2024) first demonstrated that DPO with just 5K distilled samples reverses language degradation from visual instruction tuning
- (ShapeLLM, 2024) created the first 3D multimodal LLM for embodied interaction with selective multi-view distillation
- (EyeCLIP, 2024) achieved state-of-the-art zero-shot classification across 9 ophthalmic datasets by aligning multiple imaging modalities with clinical text
- (SleepFM, 2024) introduced leave-one-out contrastive learning for physiological signals, outperforming supervised CNNs on sleep analysis (AUROC 0.88 vs 0.72)
- FoMo-in-Flux (Practitioner's Guide to Continual Multimodal Pretraining, 2024) established model merging as the superior strategy for continual multimodal updates under realistic compute budgets
- Signal-Noise Theory (On the Comparison between Multi-modal..., 2024) provided theoretical proof that multi-modal learning fundamentally suppresses noise memorization
π Shift from purely contrastive pretraining to preference-based alignment (DPO) for multimodal models, addressing the newly discovered 'modality degradation' problem where visual fine-tuning harms language capabilities
- (OmniAlign-V, 2025) created a 200K high-quality alignment dataset with semantic richness filtering, enabling a 32B model to outperform a proprietary 72B model
- (VIVID-Med, 2026) used LLM-supervised structured pretraining to create deployable medical ViTs that outperform BiomedCLIP by +6.65 macro-AUC with 500x less data
- (MM-TS, 2026) introduced density-aware temperature and margin scheduling for long-tail robustness in contrastive learning
- CVS (Does the Question Really Matter?, 2026) proposed training-free data selection that outperforms full-data training by 4.8% using only 15% of samples
- (Visual Self-Fulfilling Alignment, 2026) leveraged the self-fulfilling mechanism to align multimodal safety without explicit safety labels
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Contrastive Multi-Modal Alignment | Pull matching cross-modal pairs together while pushing non-matching pairs apart, using dynamic temperature and density-aware margin schedules to handle concept frequency imbalance. | Improves on standard fixed-temperature CLIP by +69.45% accuracy on ColoredMNIST (82.13% vs 12.68%) through multi-modal signal cooperation that suppresses spurious correlations | On the Comparison between Multi-modal... (2024), MM-TS (2026), Improving Medical Multi-modal Contrastive Learning... (2024), Turbo your multi-modal classification with... (2024) |
| Multimodal Preference Optimization | Use strong teacher models to generate preference pairs and apply DPO (Direct Preference Optimization) to align weaker models, reversing modality degradation with lightweight data. | Improves on LLaVA-1.5 baseline by +13.6 WildVision Score and +4.9% on MM-Vet, while surpassing the base language model Vicuna on text-only MT-Bench (6.73 vs 6.57) | OmniAlign-V (2025), Multi-modal Preference Alignment Remedies Degradation... (2024), Unlock the Power (2023), MM-Instruct (2024) |
| 3D-Language-Image Pretraining | Align 3D point cloud encoders with frozen 2D vision-language models using automatically mined text-image-point triplets, enabling zero-shot 3D recognition without human labels. | Improves on PointCLIP by +253% relative accuracy on nuScenes zero-shot recognition (37.8% vs 11.7%); ShapeLLM-13B outperforms PointLLM by +5.1% on 3D MM-Vet benchmark | CLIP2 (2023), ShapeLLM (2024), TAMM (2024), MM-Point (2024) |
| Domain-Specialized Vision-Language Pretraining | Use small high-quality elite datasets or structured clinical knowledge as 'sparks' to guide pretraining on larger unlabeled collections, then optionally discard the teacher for lightweight deployment. | VIVID-Med outperforms BiomedCLIP by +6.65 macro-AUC points on CheXpert using 500x less data; EyeCLIP achieves 0.757 AUROC vs 0.654 for BioMedCLIP on diabetic retinopathy zero-shot classification | VIVID-Med (2026), EyeCLIP (2024), SleepFM (2024), MM-Retinal V2 (2025) |
| Multi-Modal Architecture Mixing | Mix model weights, visual embeddings from diverse encoders, and training tasks to create versatile multimodal LLMs that combine real-world and synthetic knowledge without domain conflict. | SPHINX achieves 90.8 POPE score surpassing LLaVA-1.5-13B (85.9) and InstructBLIP-13B (78.9); LLaVA-Plus reaches 1203 Elo on VisIT-Bench, outperforming base LLaVA (1095) by 108 points | SPHINX (2023), LLaVA-Plus (2023), A Practitioner's Guide to Continual... (2024) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| nuScenes Zero-Shot Recognition | Accuracy | 37.8% | CLIP2 (2023) |
| CheXpert Linear Probing | Macro-AUC | 0.8588 | VIVID-Med (2026) |
| ModelNet40 (Self-supervised 3D Classification) | Accuracy | 92.4% | MM-Point (2024) |
| MM-AlignBench | Win Rate | 28.5% win-rate | OmniAlign-V (2025) |
| VisIT-Bench | Elo Rating | 1203 Elo | LLaVA-Plus (2023) |
β οΈ Known Limitations (4)
- Modality degradation: Visual instruction tuning significantly degrades the language backbone's original text capabilities, requiring careful alignment strategies to mitigate this inherent tension between visual and linguistic learning objectives (affects: Multimodal Preference Optimization, Multi-Modal Architecture Mixing)
Potential fix: Applying lightweight DPO with distilled preferences from stronger models, or using data filtering strategies like Conditional Verdict Shift (CVS) to select only samples requiring genuine visual reasoning - Data scarcity in specialized domains: Medical, 3D, and scientific domains have severely limited paired multi-modal data, constraining pretraining effectiveness and potentially introducing domain biases from small sample sizes (affects: Domain-Specialized Vision-Language Pretraining, 3D-Language-Image Pretraining)
Potential fix: Using elite knowledge sparks from small high-quality datasets, automatic proxy mining from unlabeled scans, and LLM-generated structured supervision to bootstrap pretraining without manual annotation - Demographic and social bias in pretrained encoders: CLIP and similar models encode demographic biases in specific attention heads, which silently propagate to all downstream applications built on these foundations (affects: Contrastive Multi-Modal Alignment)
Potential fix: Mechanistic fairness audits to identify and surgically ablate specific bias-encoding attention heads, reducing gender bias (CramΓ©r's V from 0.381 to 0.362) while preserving or improving accuracy (+0.42%) - Hallucination and over-optimization: Models may hallucinate visual content not present in the image or over-optimize toward proxy rewards, causing quality degradation beyond certain reward thresholds (reward hacking) (affects: Multimodal Preference Optimization, Multi-Modal Architecture Mixing)
Potential fix: Regulated clipping with ratio normalization and gradient balancing to prevent reward hacking; visual self-fulfilling alignment that activates safety personas through exposure to threat-related imagery without explicit safety labels
π View major papers in this topic (10)
- On the Comparison between Multi-modal and Single-modal Contrastive Learning (2024-11) 8
- OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference (2025-02) 8
- CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data (2023-03) 8
- ShapeLLM: Universal 3D Object Understanding for Embodied Interaction (2024-02) 8
- VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs (2026-03) 8
- EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis (2024-09) 8
- SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models (2023-11) 8
- LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents (2023-11) 8
- SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals (2024-05) 8
- A Practitioner's Guide to Continual Multimodal Pretraining (2024-08) 8
π‘ Moving to the next paradigm, we turn to Video Understanding.
Video Understanding
What: Research on enabling models to reason about spatiotemporal relationships, long-term dependencies, and multimodal evidence within video content.
Why: Videos are the dominant medium for information consumption, demanding AI systems that comprehend dynamic visual scenes and temporal causality.
Baseline: Standard video-language models pair a visual encoder with a decoder-only language model, processing sampled frames without explicit temporal reasoning.
- Temporal reasoning across long video sequences with complex causal and event dependencies
- Bridging heterogeneous modalitiesβvisual, linguistic, and wireless signalsβfor robust human-centric understanding
π§ͺ Running Example
Baseline: A frame-sampling Video-LMM captions individual frames independently, missing the causal link between the ingredient addition and the color change because the two events span dozens of frames apart.
Challenge: Answering requires temporal localization (finding the exact moment of addition), causal reasoning (linking ingredient chemistry to color change), and long-context reasoning across a 10-minute videoβexactly the challenges current models struggle with.
π Overall Progress
Video understanding has evolved from dataset construction (multi-modal sensing benchmarks) through RL-driven post-training of Video-LMMs (GRPO, DPO, test-time scaling) to knowledge extraction for practical applications. The field has seen a paradigm shift from purely supervised approaches to reinforcement learning pipelines that achieve comparable or superior reasoning with orders-of-magnitude less labeled data. Concurrently, privacy-preserving sensing via wireless signals has matured from dataset creation to sophisticated graph-based pose estimation.
π Sub-topics
Video Reasoning with Large Multimodal Models
2 papers
Methods that enhance Video-LMMs with post-training techniquesβsupervised fine-tuning, reinforcement learning, and test-time scalingβto advance from basic perception to sophisticated temporal and causal reasoning.
Video Knowledge Extraction for Downstream Applications
2 papers
Approaches that extract and repurpose the world knowledge embedded in Video-LLMs for practical applications such as video recommendation and question generation.
Non-Intrusive Multi-Modal Human Sensing
2 papers
Privacy-preserving approaches that use wireless signals (mmWave radar, WiFi, LiDAR) instead of cameras for 4D human pose estimation and action recognition.
π‘ Key Insights
π‘ RL-only video training matches large SFT systems with 27Γ less data
π‘ Iterating GRPO-verifier-DPO produces long reasoning chains 7Γ faster
π‘ Multi-modal wireless sensor fusion significantly outperforms single-modality sensing
π‘ Graph attention preserving inter-point relationships reduces radar pose error by 35%
π‘ Multi-layer thought vectors retain visual details lost in text-based Video-LLM outputs
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from building foundational multi-modal datasets (2023) to RL-powered video reasoning with LMMs (2025) and graph-based radar sensing (2026), reflecting a broader trend toward scalable, privacy-aware, and reasoning-capable video understanding systems.
- (MM-Fi, 2023) introduced the first five-modality synchronized dataset for non-intrusive 4D human sensing, establishing benchmarks for wireless pose estimation
- (VerIPO, 2025) demonstrated that iterating between GRPO, verifier curation, and DPO produces long reasoning chains 7Γ faster than standard GRPO
- A comprehensive survey (A Survey of Video Reasoning..., 2025) unified video post-training into SFT, RL, and test-time scaling pillars, documenting that RL-only models can match large SFT-trained systems
- (LinkedOut, 2025) proposed extracting multi-layer thought vectors from Video-LLMs for scalable recommendation
- (INQUIRER, 2025) leveraged internal knowledge graphs to improve video question generation quality
π Shift from pure supervised fine-tuning to RL-driven post-training (GRPO, DPO) for video reasoning, enabling models to develop long chain-of-thought capabilities with minimal labeled data.
- mmGAT (mmGAT: Pose Estimation by Graph..., 2026) applied graph attention networks with mutual edge features to radar point clouds, reducing pose estimation error by 35.6%
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Unified Video-LMM Post-Training Pipeline | Combines supervised fine-tuning, Group Relative Policy Optimization (GRPO), and test-time scaling into a unified post-training framework for Video-LMMs. | GRPO-based models (e.g., Video-RTS) match systems trained on ~165k SFT pairs using only ~6k video-QA triples; test-time scaling saturates after ~5 reasoning samples. | A Survey of Video Reasoning... (2025) |
| Verifier-Guided Iterative Policy Optimization | A three-stage loopβGRPO generates diverse rollouts, a rollout-aware verifier curates contrastive pairs, and DPO refines the policy toward longer, consistent reasoning. | Achieves 7Γ faster optimization than standard GRPO; outperforms Video-R1, Kimi-VL-Thinking, and Qwen2.5-VL-7B on VSI-Bench and Video-MME. | VerIPO (2025) |
| Cross-Layer Knowledge-Fusion MoE | Extracts hidden states from multiple layers of a Video-LLM backbone and uses a Mixture-of-Experts (MoE) router to dynamically select the most relevant abstraction level per video. | Replaces final-layer text output with multi-layer thought vectors, retaining fine-grained visual details lost in conventional text-based video representation. | LinkedOut (2025) |
| Multi-Modal Non-Intrusive 4D Human Sensing | Synchronizes five sensor modalities via a custom robotic platform, providing 4D spatial-temporal labels for 27 actions across 40 subjects. | Fusing LiDAR + mmWave significantly improves pose estimation over single wireless modalities; ground-truth achieves 95.66% PCKh@0.5 re-projection accuracy. | MM-Fi (2023) |
| Graph Attention Radar Pose Estimation | Models radar point clouds as directed graphs with a mutual-feature extraction block that computes pairwise attributes (velocity, distance) before graph attention. | Reduces Mean Per Joint Position Error (MPJPE) by 35.6% and PA-MPJPE by 14.1% over state-of-the-art on the mRI dataset. | mmGAT: Pose Estimation by Graph... (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| mRI Dataset (Radar Pose Estimation) | Mean Per Joint Position Error (MPJPE) | 35.6% reduction over prior state-of-the-art (absolute MPJPE not reported) | mmGAT: Pose Estimation by Graph... (2026) |
| Video-MME | Accuracy | Outperforms Video-R1, Kimi-VL-Thinking, and Qwen2.5-VL-7B (absolute score not reported) | VerIPO (2025) |
| MM-Fi Re-projection (Pose Ground Truth Quality) | PCKh@0.5 (Percentage of Correct Keypoints with head-normalized threshold) | 95.66% PCKh@0.5 | MM-Fi (2023) |
β οΈ Known Limitations (4)
- Test-time scaling saturates quickly: performance gains plateau after approximately 5 reasoning samples during self-consistency voting, limiting the benefit of additional compute at inference. (affects: Unified Video-LMM Post-Training Pipeline)
Potential fix: More sophisticated aggregation strategies beyond majority voting, or adaptive sample budgets that allocate more compute only to hard examples. - RL training instability: GRPO-based reinforcement learning for video reasoning can produce unstable improvements in chain-of-thought quality, especially without careful reward design. (affects: Unified Video-LMM Post-Training Pipeline, Verifier-Guided Iterative Policy Optimization (VerIPO))
Potential fix: Verifier-based curation (as in VerIPO) to filter low-quality rollouts, or multi-stage pipelines combining SFT warmup with RL fine-tuning. - Deployment latency of Video-LLMs: Decode-only generation and large model sizes make Video-LLMs impractical for latency-sensitive applications like recommendation systems. (affects: Cross-Layer Knowledge-Fusion MoE)
Potential fix: Extract compact hidden-state representations (thought vectors) offline and use lightweight downstream models for real-time inference. - Wireless sensing accuracy gap: Radar and WiFi-based pose estimation still lags behind camera-based methods in precision, particularly for fine-grained hand and finger movements. (affects: Multi-Modal Non-Intrusive 4D Human Sensing, Graph Attention Radar Pose Estimation)
Potential fix: Multi-modal fusion (e.g., LiDAR + mmWave) and graph-based architectures that preserve spatial relationships between radar points.
π View major papers in this topic (6)
- A Survey of Video Reasoning with Large Multimodal Models (2025-10) 9
- MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset for Versatile Wireless Sensing (2023-05) 9
- VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization (2025-05) 7
- LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation (2025-12) 7
- mmGAT: Pose Estimation by Graph Attention with Mutual Features from mmWave Radar Point Cloud (2026-03) 7
- INQUIRER: Harnessing internal knowledge graphs for video question generation (2025-08) 6
π‘ Diving deeper into Video Understanding, let's examine specific research threads that define this area.
Video QA and Captioning
What: Research on enabling AI models to answer natural-language questions about videos and generate grounded textual descriptions, requiring joint visual perception, temporal reasoning, and language generation.
Why: Videos dominate information consumption, yet AI models struggle with temporal dynamics, long-duration content, and grounding responses in specific visual evidence.
Baseline: Uniformly sample video frames, encode them with a frozen visual encoder, and concatenate visual tokens with text for a large language model to generate answers.
- Processing long or streaming videos that exceed context windows while preserving temporal coherence across distant events
- Achieving fine-grained spatiotemporal perception beyond surface-level recognition to detect subtle actions and rare moments
- Grounding textual responses in verifiable visual evidence rather than hallucinating from language priors
π§ͺ Running Example
Baseline: A standard Video LLM uniformly samples ~32 frames from 30 minutes, almost certainly missing the 2-second salting mistake. It generates a generic summary like 'The chef prepared a stew' without temporal grounding or causal reasoning about the error.
Challenge: This example requires: (1) long-video processing to scan 30 minutes efficiently, (2) fine-grained perception to notice the brief over-salting moment among routine actions, (3) temporal localization to pinpoint when it happened, and (4) causal reasoning to link the mistake to recovery steps.
π Overall Progress
The field has undergone a fundamental paradigm shift from task-specific video models (2023) through general-purpose video MLLMs with memory augmentation (2024) to RL-trained reasoning systems with explicit evidence grounding (2025-2026). The dominant training paradigm evolved from supervised fine-tuning to GRPO-based reinforcement learning, with consistency-aware and difficulty-aware variants addressing reward hacking. Architecturally, the field converged on token-efficient designs that reduce compute by 5-10x while enabling processing of multi-hour videos.
π Sub-topics
Reinforcement Learning for Video Reasoning
15 papers
Methods applying reinforcement learning β primarily Group Relative Policy Optimization (GRPO) and its variants β to improve video MLLMs' reasoning, perception, and captioning capabilities beyond what supervised fine-tuning achieves.
Chain-of-Thought & Structured Video Reasoning
8 papers
Approaches that decompose video question answering into explicit multi-step reasoning chains, including visual chain-of-thought, tool-augmented reasoning, and interleaved video-text reasoning paradigms.
Long & Streaming Video Understanding
10 papers
Methods for processing videos ranging from minutes to hours (or continuous streams) by using memory banks, hierarchical representations, agentic search, and online processing to overcome context window limitations.
Efficient Video-Language Architectures
8 papers
Architectural innovations that reduce the computational cost of video LLMs through token compression, codec-aware encoding, encoder-free designs, two-stream projectors, and parameter-space visual alignment.
Video Understanding Benchmarks & Evaluation
6 papers
Benchmark datasets and evaluation frameworks that assess video MLLMs across temporal reasoning, chain-of-thought quality, long-video comprehension, and complex multi-step inference.
Domain-Specific Video Applications
12 papers
Specialized video QA and captioning systems tailored to specific domains including autonomous driving, accident analysis, egocentric multi-agent collaboration, advertisement understanding, spatial reasoning, and temporal grounding.
π‘ Key Insights
π‘ Reinforcement learning with verifiable rewards outperforms supervised fine-tuning by 10-15% on video reasoning tasks.
π‘ Visual perception, not logical reasoning, is the primary bottleneck in video chain-of-thought models.
π‘ Token-efficient encoding reduces video LLM compute by 5-10x while maintaining or improving accuracy.
π‘ Memory-augmented streaming enables constant-cost processing of arbitrarily long videos.
π‘ Surprise-weighted frame sampling consistently outperforms uniform sampling across diverse benchmarks.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has shifted from 'how to encode video for LLMs' toward 'how to reason about video with verifiable evidence.' The 2025 RL revolution made GRPO the de facto post-training method, while agentic tool-use and streaming architectures extended practical video understanding from minutes to days.
- mPLUG-2 (mPLUG-2: A Modularized Multi-modal Foundation..., 2023) introduced modularized multi-modal pre-training with shared universal layers, achieving SOTA on MSRVTT Video QA (48.0%) and Captioning (80.3 CIDEr)
- (MM-AU, 2023) introduced tone transition tracking as a formal task for understanding condensed ad narratives across 8.4K multilingual videos
- (MVBench, 2023) established 20 temporal video understanding tasks with VideoChat2 baseline surpassing GPT-4V by 7.6%
- (MM-VID, 2023) pioneered the video-to-script generation pipeline for processing hour-long content through GPT-4V
- (MM-Narrator, 2023) introduced memory-augmented recurrent generation for audio descriptions spanning hours of video
π Transition from task-specific video models to general-purpose multi-modal LLMs capable of open-ended video conversation.
- (Video-MME, 2024) created the first full-spectrum video evaluation benchmark across durations and modalities, becoming the de facto standard
- (MA-LMM, 2024) introduced online memory-bank processing for constant-cost long video understanding, achieving 60.7% on LVU
- (Video-of-Thought, 2024) bridged perception and cognition with scene-graph grounded reasoning chains
- BEV-InMLLM (Holistic Autonomous Driving Understanding, 2024) injected Bird's-Eye-View features into MLLMs for holistic autonomous driving understanding across 91K multi-view QA pairs
- (AIM, 2024) achieved 6.8x FLOPs reduction via training-free token merging and PageRank-based pruning
- LLaVA-Hound-DPO (Aligning Large Multimodal Models with..., 2024) demonstrated caption-based proxy rewards for scalable video RLHF at <$20 for 120K pairs
- (GRPO-CARE, 2025) introduced consistency-aware rewards that improved reasoning quality by +24.5% over standard GRPO
- V-JEPA 2 (V-JEPA 2, 2025) scaled self-supervised latent video prediction to 1M+ hours, achieving 77.3% on Something-Something v2 and enabling robotic planning
- Open-o3-(Open-o3-Video, 2025) introduced curriculum RL for joint spatio-temporal grounding with explicit evidence tags
- (Deep Video Discovery, 2025) reframed video understanding as iterative agentic search, achieving 74.2% SOTA on LVBench
- Seed1.5-(Seed1.5-VL Technical Report, 2025) achieved SOTA on 38 of 60 public benchmarks using dynamic frame-resolution sampling and hybrid RL
- (CoPE-VideoLM, 2026) leveraged video codec structure to enable 8 hours of video within 1M tokens at 86.2% TTFT reduction
- (Video-Based, 2026) treated agent evaluation as video understanding with spatiotemporal token pruning, achieving 84.7% accuracy surpassing GPT-5.2
- (Think While Watching, 2026) decoupled visual input from text output for concurrent streaming perception and generation with 92.6% latency reduction
π Shift from supervised fine-tuning to reinforcement learning with verifiable rewards as the dominant post-training paradigm for video MLLMs, with GRPO variants appearing in over 15 papers.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| GRPO-Based Video Reinforcement Learning | Train video models with group-relative reward signals β comparing outputs within a batch β to learn robust reasoning without expensive human annotations. | Improves on standard GRPO by +6.7% accuracy on SEED-Bench-R1 Level-3 (GRPO-CARE), and on SFT baselines by +15.6% UAR on DFEW (R1-Omni), achieving state-of-the-art video reasoning. | GRPO-CARE (2025), Open-o3-Video (2025), Video-STR (2025), R1-Omni (2025), Rethinking Chain-of-Thought Reasoning for Videos (2025) |
| Chain-of-Thought Video Reasoning | Treat selected video frames as 'visual thoughts' analogous to textual chain-of-thought, curating visual context iteratively before generating the final answer. | Temporal CoT improves on standard inference by +11.4 points on LVBench (avg 68-min videos) using the same 32K token budget, achieving state-of-the-art on 4 benchmarks. | Temporal Chain of Thought: Long-Video... (2025), Video-of-Thought (2024), Video-CoT (2025), Thinking With Videos (2025) |
| Memory-Augmented Long Video Understanding | Decouple video perception from language generation using persistent memory banks that compress, store, and retrieve temporal context on demand. | MA-LMM improves on S5 baseline by +3.8% on LVU benchmark, achieving 60.7% accuracy. Think While Watching reduces time-to-first-token by 92.6% while matching offline accuracy. | MA-LMM (2024), Think While Watching (2026), Deep Video Discovery (2025), Ego-R1 (2025) |
| Token-Efficient Video Encoding | Exploit the massive redundancy in video frames by merging similar tokens, encoding only visual changes, or transforming video features into lightweight weight updates. | AIM reduces FLOPs by 6.8x over LLaVA-OV-7B with +4.6 points on MLVU when using efficiency gains for more frames. CoPE reduces time-to-first-token by 86.2% over LLaVA-Video-7B. | AIM (2024), CoPE-VideoLM (2026), ViPE (2025), SlowFast-LLaVA-1.5 (2025) |
| Agentic Video Search & Tool Use | Empower an LLM agent with modular video tools and train it via reinforcement learning to plan optimal tool-use sequences for complex queries. | DVD achieves 74.2% accuracy on LVBench, setting a new state-of-the-art and surpassing all prior works by a large margin. VITAL improves by +11.4% on LongVideo-Reason over the previous best open-source model. | Deep Video Discovery (2025), Ego-R1 (2025), Thinking With Videos (2025), RAVEN (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| Video-MME | Accuracy (%) | 81.3% | Video-MME (2024) |
| MVBench | Average Accuracy (%) | 51.1% | MVBench (2023) |
| LVBench | Accuracy (%) | 74.2% | Deep Video Discovery (2025) |
| QVHighlights | mAP (IoU=0.5) | +2.8% mAP over TRACE | TimeExpert (2025) |
| SEED-Bench-R1 | Accuracy (%) | +6.7% on Level-3 over standard GRPO | GRPO-CARE (2025) |
β οΈ Known Limitations (4)
- RL training instability and high compute cost: GRPO variants require careful reward design and significant GPU resources, with reward hacking remaining a persistent risk where models find shortcut solutions. (affects: GRPO-Based Video Reinforcement Learning, Chain-of-Thought Video Reasoning)
Potential fix: GRPO-CARE's consistency-aware rewards and FaVChat's data-efficient DE-GRPO demonstrate that adaptive reward mechanisms and sample utility estimation can mitigate instability and reduce data requirements. - Hallucination in video descriptions: Models generate plausible but fabricated details not present in the video, with faithfulness scores as low as 34% before mitigation, due to over-reliance on language priors. (affects: Chain-of-Thought Video Reasoning, Memory-Augmented Long Video Understanding)
Potential fix: Dynamic ad-hoc RAG for cross-verification (ResNetVLLM-2) and caption-based proxy rewards for DPO alignment (LLaVA-Hound-DPO) improve faithfulness from 34% to 98% in controlled settings. - Context window limits for very long videos: Even with compression, multi-hour or multi-day videos exceed model capacity, and aggressive compression risks losing rare but critical events. (affects: Memory-Augmented Long Video Understanding, Token-Efficient Video Encoding)
Potential fix: Hierarchical RAG with tool-based retrieval (Ego-R1) and codec-primitive encoding (CoPE) extend coverage to 8 hours and full weeks respectively, though reliability at scale remains unproven. - Benchmark saturation and evaluation gaps: Models achieve high accuracy on standard MCQ benchmarks through text-based elimination without genuine visual understanding, as shown by significant credibility gaps when grounding is required. (affects: GRPO-Based Video Reinforcement Learning, Chain-of-Thought Video Reasoning)
Potential fix: CG-Bench's clue-grounded evaluation and VCR-Bench's stepwise process scoring offer more rigorous evaluation, but widespread adoption of process-centric metrics is needed.
π View major papers in this topic (10)
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (2025-06) 9
- Video-Based Reward Modeling for Computer-Use Agents (2026-03) 9
- GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning (2025-06) 8
- Seed1.5-VL Technical Report (2025-05) 8
- Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding (2025-05) 8
- MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (2023-11) 8
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding (2024-04) 8
- mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (2023-02) 8
- Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence (2025-10) 8
π‘ Within the same paradigm, another important research direction focuses on Temporal Reasoning and Action.
Temporal Reasoning and Action
What: Research on enabling multimodal models to understand temporal dynamics in video, including event ordering, moment localization, action recognition, and causal reasoning over time.
Why: Accurate temporal reasoning is essential for video-based AI applications like embodied agents, autonomous navigation, and interactive video assistants.
Baseline: Standard Video LLMs uniformly sample frames and use next-token prediction, treating video as a bag of static images without explicit temporal modeling.
- Models exploit static frame shortcuts instead of reasoning about event progression and temporal ordering
- Existing benchmarks contain noisy annotations and ambiguous queries that mask true temporal understanding gaps
- Long videos overwhelm context windows, causing models to miss brief or rare temporal events
π§ͺ Running Example
Baseline: A standard Video LLM uniformly samples 32 frames from the 10-minute video, likely missing the 2-second salt-adding moment entirely. It guesses 'the chef adds salt around the middle' based on common cooking scripts rather than visual evidence.
Challenge: This example requires temporal localization (pinpointing a 2-second window among 10 minutes), fine-grained perception (distinguishing the brief salt-adding gesture from similar hand movements), and causal reasoning (understanding what follows β stirring the pasta β as a consequence).
π Overall Progress
The field progressed from static benchmark evaluation (2023) through structured reasoning frameworks (2024) to a reinforcement learning revolution (2025) where temporal-aware reward signals became the dominant paradigm. The key paradigm shift was recognizing that standard next-token prediction fundamentally fails to capture temporal dynamics, leading to contrastive RL methods that explicitly penalize static frame exploitation. Concurrently, the community addressed data quality issues, revealing that 20-35% of popular benchmark annotations are flawed.
π Sub-topics
Reinforcement Learning for Temporal Reasoning
6 papers
Methods that modify reinforcement learning algorithms (especially GRPO) with temporal-aware rewards to train video models that genuinely understand event progression rather than exploiting static visual shortcuts.
Chain-of-Thought Video Reasoning
5 papers
Approaches that decompose video question-answering into structured multi-step reasoning processes, using intermediate visual or symbolic representations to ground each reasoning step in specific video evidence.
Temporal Grounding and Localization
3 papers
Specialized architectures and training recipes for precisely locating event boundaries in videos, including moment retrieval, highlight detection, and dense video captioning with timestamps.
Egocentric Activity Understanding
4 papers
Research on understanding activities from first-person viewpoints, requiring inference of the camera wearer's hidden intentions, hand-object interactions, and spatial navigation through dynamic environments.
Benchmarks, Datasets, and Foundation Models
7 papers
Evaluation frameworks, large-scale datasets, and unified foundation models that establish standards and baselines for measuring temporal video understanding capabilities.
π‘ Key Insights
π‘ Contrastive temporal rewards prevent static frame shortcut exploitation in video reasoning
π‘ Selective frame curation with 32K tokens outperforms 700K-token brute-force processing
π‘ Popular temporal benchmarks contain 20-35% flawed annotations, distorting evaluations
π‘ Decoupling temporal localization from text generation yields 20%+ grounding improvements
π‘ Tool-augmented active video clipping reduces hallucination in long-video understanding
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research evolved from building temporal benchmarks and datasets (2023-2024) to developing RL-based training paradigms that enforce genuine temporal understanding (2025), with an increasing emphasis on grounded, verifiable reasoning with explicit spatio-temporal evidence.
- AssemblyHands (CVPR 2023) established the largest egocentric 3D hand pose benchmark with 3M annotated images using multi-view annotation to overcome occlusion challenges
- MVBench (CVPR 2023) introduced 20 systematic temporal video understanding tasks, revealing that existing MLLMs including GPT-4V scored below 50% on temporal reasoning
- CG-Bench exposed the 'credibility gap' in long-video benchmarks, showing model accuracy drops from ~53% to ~21% when requiring clue-grounded evidence rather than multiple-choice elimination
- Video-of-Thought (ICML 2024) pioneered step-by-step video reasoning using spatial-temporal scene graphs as intermediate rationales, bridging perception and cognition
- MM-WLAuslan (NeurIPS 2024) curated the first large-scale Australian Sign Language dataset with 282K+ multi-view videos for temporal action recognition
- Video-R1 (2025-03) pioneered Temporal GRPO with contrastive temporal rewards, establishing the first systematic RL approach for video temporal reasoning
- TEMPLE (2025-03) reversed the standard training order by applying preference learning before instruction tuning to establish fundamental temporal alignment
- Temporal Chain of Thought (2025-07) demonstrated that self-reflective frame selection with 32K tokens outperforms brute-force 700K-token context windows
- VITAL (2025-08) introduced tool-augmented reasoning where models actively clip and re-examine video segments during their reasoning chain
- Video-STR (2025-10) extended RL with graph-based verifiable rewards for precise spatio-temporal object relation modeling
- D2VLM (2025-11) factorized temporal grounding into evidence finding and text generation stages, achieving +21.6% F1 improvement on grounding benchmarks
- TimeLens (2025-12) exposed 20-35% annotation quality issues in popular temporal grounding benchmarks and proposed curated re-annotation with RLVR training
π The field shifted from supervised fine-tuning to RL-based temporal reasoning, with T-GRPO and its variants becoming the dominant training paradigm for enforcing genuine temporal understanding in video models.
- Human-AI (2026-03) revealed that AI models degrade more gradually than humans on spatial reduction but show class-dependent sensitivity to temporal scrambling, establishing new metrics for measuring temporal robustness gaps
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Temporal Reinforcement Learning | Contrastive temporal rewards compare model accuracy on ordered versus shuffled frames, penalizing reliance on static visual content. | Video-R1 achieves 37.1% accuracy on VSI-Bench, outperforming GPT-4o; Video-STR improves on Qwen2.5-VL-7B by +13% on STI-Bench, surpassing GPT-4o on spatio-temporal reasoning | Video-R1 (2025), Video-STR (2025), Open-o3-Video (2025), VideoPerceiver (2025) |
| Visual Chain-of-Thought Reasoning | Selected video frames serve as visual thoughts, enabling focused reasoning on relevant evidence rather than processing entire videos. | Temporal CoT improves by +11.4 points on LVBench (avg 68-min videos) vs standard inference with the same 32K token budget; CoTasks achieves +34.3% accuracy on STAR benchmark for Qwen2.5-VL-3B | Video-of-Thought (2024), Temporal Chain of Thought: Long-Video... (2025), Video-CoT (2025), SG-VLM (2025) |
| Factorized Temporal Grounding | Decoupling temporal boundary prediction from text generation allows each subtask to be optimized independently with task-specific mechanisms. | D2VLM achieves +21.6% average F1 on E.T. Bench Grounding over E.T.Chat-3.8B (60.2% F1); TimeExpert achieves +2.8% mAP (IoU=0.5) on QVHighlights over TRACE; TimeLens-8B surpasses GPT-5 on TimeLens-Bench | TimeExpert (2025), Factorized Learning for Temporally Grounded... (2025), TimeLens (2025) |
| Egocentric Spatio-Temporal Reasoning | Reverse thinking β mentally retracing a route backwards β mimics human cognitive processes for spatial recall from egocentric perspectives. | EgoThinker achieves state-of-the-art on EgoTimeQA and Ego-QA benchmarks; AssemblyHands MVExoNet achieves 4.20mm keypoint error, an 85% error reduction from Assembly101's 27.55mm | AssemblyHands (2023), ST-Think (2025), EgoThinker (2025) |
| Tool-Augmented Video Reasoning | Models 'think with videos' by iteratively clipping and re-examining relevant segments, enabling active visual evidence gathering during reasoning. | VITAL achieves +11.4% accuracy on LongVideo-Reason (79.3% vs 67.9% previous best open-source); +7.3% Recall@1 on VidChapters-7M temporal grounding (34.7% vs 27.4%) | Thinking With Videos (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| STI-Bench | Accuracy | Surpasses GPT-4o (exact score not reported) | Video-STR (2025) |
| E.T. Bench Grounding | Average F1 | 60.2% F1 | Factorized Learning for Temporally Grounded... (2025) |
| QVHighlights | mAP (IoU=0.5) | +2.8% mAP over TRACE (absolute score not specified) | TimeExpert (2025) |
| LVBench | Accuracy | +11.4 points over standard inference baseline | Temporal Chain of Thought: Long-Video... (2025) |
| NExT-QA (Temporal/Causal) | Accuracy | +23.6% over ViperGPT baseline with InternVL-14B | SG-VLM (2025) |
β οΈ Known Limitations (4)
- Static frame exploitation β models can achieve high accuracy on many temporal benchmarks by reasoning from individual frames rather than understanding event progression, undermining the validity of temporal evaluations (affects: Temporal Reinforcement Learning (T-GRPO), Visual Chain-of-Thought Reasoning)
Potential fix: Contrastive temporal rewards (T-GRPO) and temporal preference alignment (TEMPLE) explicitly penalize frame-order-invariant answers, but require careful reward calibration - Benchmark annotation quality β 20-35% of samples in popular temporal grounding benchmarks have ambiguous queries or inaccurate timestamps, causing misleading model comparisons and rewarding shortcut learning (affects: Factorized Temporal Grounding, Temporal Reinforcement Learning (T-GRPO))
Potential fix: Manual re-annotation (TimeLens-Bench) and clue-grounded evaluation (CG-Bench) provide higher-quality assessments but are expensive to scale - Long video scalability β context window limitations force uniform frame sampling that misses brief or rare events in videos exceeding 10 minutes, with performance degrading significantly on hour-long content (affects: Visual Chain-of-Thought Reasoning, Tool-Augmented Video Reasoning)
Potential fix: Dynamic segment processing (Temporal CoT) and tool-augmented clipping (VITAL) decouple video length from context limits, but add inference-time computation - Egocentric domain gap β first-person video understanding requires inferring unobservable agent intentions and handling severe hand-object occlusions, for which standard third-person training data provides inadequate supervision (affects: Egocentric Spatio-Temporal Reasoning)
Potential fix: Large-scale egocentric datasets (EgoRe-5M) and multi-view exocentric annotation pipelines (AssemblyHands) help bridge the gap, but collecting diverse egocentric data remains challenging
π View major papers in this topic (10)
- Video-R1: Reinforcing Video Reasoning in MLLMs (2025-03) 8
- Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames (2025-07) 8
- TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs (2025-12) 8
- Factorized Learning for Temporally Grounded Video-Language Models (2025-11) 8
- Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph (2025-10) 8
- Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning (2025-08) 8
- MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (2023-11) 8
- CG-Bench: Clue-Grounded Question Answering Benchmark for Long Video Understanding (2024-01) 8
- AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation (2023-04) 8
- MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset (2024-10) 9
π‘ Moving to the next paradigm, we turn to Embodied AI and Robotics.
Embodied AI and Robotics
What: Research on AI systems that perceive, reason, and act in physical or simulated environments through vision, language, and motor control.
Why: Enabling robots and agents to autonomously perform complex real-world tasks requires closing the loop between perception, reasoning, and physical action.
Baseline: Traditional systems decouple perception, planning, and control into separate hand-engineered modules with task-specific training on each component.
- Bridging the sim-to-real gap: policies trained in simulation degrade under real-world noise, dynamics, and visual diversity
- Long-horizon reasoning under partial observability: agents must plan over extended sequences with incomplete sensory information
- Scaling robot learning: collecting diverse, high-quality demonstration data is expensive and limits generalization
π§ͺ Running Example
Baseline: A traditional modular pipeline would use a pre-built map for navigation, a fixed object detector for the bottle, and a scripted grasp routine. It would fail if the map is outdated, the bottle looks different than training examples, or obstacles block the planned path.
Challenge: This task requires multi-floor navigation (long-horizon planning), recognizing an object from a language description under visual clutter (vision-language grounding), adapting to unexpected obstacles like a closed door (dynamic replanning), and safely grasping a small object (dexterous manipulation).
π Overall Progress
The field has evolved from modular perception-planning-control pipelines to end-to-end foundation models that learn directly from raw sensory inputs. A critical paradigm shift occurred with the introduction of self-improving training loops, where robots generate their own reward signals for autonomous practice, reducing dependence on expensive human demonstrations. Most recently, the community has shifted focus toward safety-critical evaluation, revealing that even frontier MLLMs suffer from 'causal blindness' when assessing physical consequences in embodied settings.
π Sub-topics
Vision-Language Navigation
12 papers
Methods enabling agents to follow natural language instructions to navigate continuous or discrete environments, including topological planning, affordance-based path selection, and map-guided prompting.
GUI and Device Control Agents
9 papers
Autonomous agents that interact with graphical user interfaces on smartphones and desktops via vision-based understanding of screenshots, using VLMs for planning and specialized tools for precise element localization.
Vision-Based Agile Locomotion and Flight
6 papers
End-to-end policies mapping raw visual inputs directly to motor commands for agile quadrotor flight, quadruped parkour, and legged robot soccer, typically using model-based RL or privileged distillation.
Robot Learning and Manipulation
8 papers
Approaches for learning generalizable manipulation skills through self-improvement, simulation data generation, tool-use transfer from human videos, and few-shot augmentation for dexterous tasks.
3D Scene Understanding and Spatial Reasoning
12 papers
Building semantic 3D representations for embodied agents, including open-vocabulary scene graphs, reasoning segmentation, multi-frame spatial reasoning, and language-driven 3D scene generation.
Embodied Reasoning, Safety, and Evaluation
14 papers
Benchmarks and methods evaluating embodied agents on physical reasoning, safety-critical decision-making, long-horizon scene prediction, and multi-modal comprehension in diverse environments.
Specialized Robotic Systems and Sensors
12 papers
Domain-specific robotic platforms and sensing technologies including surgical tactile sensors, egocentric AR data platforms, medical navigation, agricultural localization, and space perception.
π‘ Key Insights
π‘ Self-generated rewards enable robots to surpass imitation learning with 80% less human data.
π‘ Decoupling slow reasoning from fast control achieves real-time 30Hz embodied navigation.
π‘ Frontier MLLMs exhibit causal blindness, failing to foresee physical consequences in 30-92% of cases.
π‘ Pure RL-trained reasoning segmentation outperforms supervised approaches with 100x less data.
π‘ World-model imagination enables zero-shot sim-to-real transfer for agile robotic control.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from building foundational perception tools (2023) through scaling autonomous RL-based agents (2024), to self-improving models and dual-system architectures (2025), and now focuses on safety-critical evaluation and consequence-aware alignment for real-world deployment (2026).
- (ETPNav, 2023) introduced online topological graph construction with ghost-node prediction, winning the CVPR 2022 RxR-Habitat Challenge
- (Project Aria, 2023) released a comprehensive egocentric multi-sensor hardware platform with Machine Perception Services for AR research
- (LISA, 2023) pioneered the embedding-as-mask paradigm, enabling segmentation from complex implicit queries via LLM reasoning
- (ConceptGraphs, 2023) replaced dense feature clouds with structured object-centric 3D graphs for open-vocabulary planning
- (GPT-4V, 2023) demonstrated zero-shot GUI navigation using GPT-4V with Set-of-Mark visual grounding
- (Mobile-Agent, 2024) and its successor Mobile-Agent-v2 (Mobile-Agent-v2, 2024) established vision-centric autonomous mobile device agents with multi-agent collaboration
- (DigiRL, 2024) scaled offline-to-online RL for GUI control, achieving 67.2% success on Android-in-the-Wild with a 1.3B model outperforming 18B CogAgent
- (GOAT-Bench, 2024) introduced multi-modal lifelong navigation with sequential subtasks testing persistent memory
- (SoloParkour, 2024) and the vision-based robot soccer work (Learning Robot Soccer from Egocentric Vision, 2024) achieved agile real-world locomotion from raw visual inputs
- GenSim2 (GenSim2, 2024) leveraged reasoning LLMs and multi-modal feedback for scalable simulation task generation, improving real-world success by +21.2%
π Shift from static supervised training to autonomous online RL for embodied agents, demonstrated by DigiRL's +49.5% improvement over supervised baselines on real-world device control.
- (Self-Improving, 2025) introduced steps-to-go prediction as a self-generated reward, boosting success from 45% to 75% with 10% autonomous practice
- (Dream to Fly, 2025) achieved the first autonomous pixel-to-command drone flight using world-model RL without intermediate representations
- DualVLN (Ground Slow, Move Fast, 2025) proposed the first asynchronous dual-system VLN model achieving real-time 30Hz continuous control
- (Seg-Zero, 2025) demonstrated emergent reasoning segmentation via pure RL (GRPO), surpassing supervised LISA by 18% zero-shot
- (Multi-SpatialMLLM, 2025) equipped MLLMs with robust multi-frame spatial understanding, outperforming GPT-4o by 27 points
π Emergence of self-improving foundation models that learn autonomously beyond imitation, reducing dependence on expensive human demonstrations by up to 80%.
- (PhyCritic, 2026) introduced self-referential critic fine-tuning for physical AI, requiring models to solve problems before evaluating others' answers
- (Bi-level Expert-to-Policy Assimilation, 2026) achieved +40.5% relative improvement on OSWorld-Verified by converting expert traces into reachable student trajectories
- (LabShield, 2026) revealed a 32% performance drop when frontier MLLMs move from text-based MCQs to visual laboratory hazard scenarios
- (OOD-MMSafe, 2026) introduced consequence-driven safety alignment (CASPO), reducing risk identification failure from 51% to 5.7%
- (MANSION, 2026) generated 1,000+ multi-floor buildings, exposing sharp performance degradation of SOTA agents on vertical navigation tasks
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| End-to-End World-Model RL for Agile Control | Train a world model in latent space from pixels and learn the policy by 'dreaming' inside it, bypassing sample-inefficient real-world interactions. | Achieves 100% gate traversal success in simulation where model-free PPO completely fails (0%), and deploys zero-shot to real drones at 1.5 m/s (Dream to Fly). SoloParkour clears obstacles 1.5x the robot's height, matching privileged teacher performance. | Dream to Fly (2025), SoloParkour (2024), Bootstrapping Reinforcement Learning with Imitation... (2024), Learning Robot Soccer from Egocentric... (2024) |
| Self-Improving Robotic Foundation Models | Use the model's own predictions (e.g., steps-to-go estimates or VLM-based evaluators) as reward signals for autonomous RL-based self-improvement. | Self-Improving Foundation Models boost real-world success from 45% to 75% with just 10% additional autonomous practice, outperforming 8x more human demonstration data (60%). DigiRL achieves 67.2% on Android-in-the-Wild, a +49.5% absolute improvement over supervised fine-tuning (17.7%). | Self-Improving (2025), DigiRL (2024), From Off-Policy to On-Policy: Enhancing... (2026) |
| Dual-System Vision-Language Navigation | Separate 'thinking slowly' (VLM-based global planning) from 'moving fast' (diffusion or heuristic local control), connected via latent queries or topological graphs. | DualVLN achieves 0.03s inference latency (30Hz real-time control) versus 0.7s+ for monolithic VLM approaches. ETPNav improves +25.99% Success Rate over RecBERT on RxR-CE and won the CVPR 2022 RxR-Habitat Challenge, doubling the second-best model's score. | Ground Slow, Move Fast: A... (2025), ETPNav (2023), MapGPT (2024) |
| Reasoning Segmentation via LLM-Grounded Perception | Introduce a special segmentation token in the LLM whose hidden embedding directly prompts a mask decoder, unifying language reasoning and pixel-level perception. | LISA-13B achieves 63.2 gIoU on ReasonSeg, outperforming the specialist model SEEM (25.6 gIoU) by +37.6 points. Seg-Zero achieves 57.5 zero-shot on ReasonSeg, surpassing prior LISA-7B by 18% using pure RL without supervised reasoning traces. | LISA (2023), Seg-Zero (2025), Active-o3 (2025) |
| Object-Centric 3D Scene Graphs for Embodied Planning | Replace dense per-point feature maps with graph-structured object nodes enriched by VLM captions and LLM-reasoned spatial relationships for scalable embodied planning. | ConceptGraphs improves +16.47 mAcc over ConceptFusion on open-vocabulary 3D segmentation and achieves 0.80 Recall@1 on complex negation queries versus 0.26 for CLIP-based retrieval. MANSION generates 1,000+ multi-floor buildings where SOTA agents show sharp performance degradation. | ConceptGraphs (2023), Scenethesis (2025), MANSION (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| Android-in-the-Wild (AitW) | Task Success Rate | 67.2% | DigiRL (2024) |
| ReasonSeg | gIoU (generalized Intersection over Union) | 63.2 gIoU | LISA (2023) |
| RxR-CE (Room-across-Room Continuous Environment) | Success Rate (SR) | +25.99% SR improvement over RecBERT baseline | ETPNav (2023) |
| OSWorld-Verified | Task Success Rate | 32.13% | From Off-Policy to On-Policy: Enhancing... (2026) |
| LanguageTable (Real-World Robotic Manipulation) | Task Success Rate | ~87-88% | Self-Improving (2025) |
β οΈ Known Limitations (4)
- Sim-to-real transfer gap: Policies trained in simulation often degrade significantly when deployed in real-world settings due to visual, dynamic, and physical mismatches that domain randomization alone cannot fully address. (affects: End-to-End World-Model RL for Agile Control, Self-Improving Robotic Foundation Models)
Potential fix: NeRF-based rendering for photorealistic simulation backgrounds (Robot Soccer), domain randomization combined with privileged warm-starting (SoloParkour), and geometry-focused representations like point clouds that are more transfer-friendly (GenSim2). - Safety in physical environments: Current embodied agents lack the ability to anticipate hazardous physical consequences of their actions, which is critical for deployment in laboratories, surgical settings, and household environments. (affects: Self-Improving Robotic Foundation Models, Dual-System Vision-Language Navigation)
Potential fix: Consequence-Aware Safety Policy Optimization (CASPO) shifts alignment from intent detection to causal projection, reducing risk failure from 51% to 5.7%. LabShield proposes multi-view visual data with OSHA-standard safety taxonomies. - Scalability of demonstration data: High-quality robotic demonstration data is expensive and time-consuming to collect, limiting the diversity and generalization of learned policies. (affects: Self-Improving Robotic Foundation Models, Object-Centric 3D Scene Graphs for Embodied Planning)
Potential fix: Self-improvement loops using steps-to-go rewards reduce data needs by 80% (Self-Improving FM). GenSim2 automates task generation via reasoning LLMs and visual feedback. Tool-as-Interface reduces collection time by 77% by learning from human videos instead of teleoperation. - Long-horizon reasoning under partial observability: Agents struggle with multi-step tasks in partially observable environments, especially in multi-floor buildings or cluttered rooms where key information is occluded or distant. (affects: Dual-System Vision-Language Navigation, Object-Centric 3D Scene Graphs for Embodied Planning)
Potential fix: Hierarchical chain-of-thought prompting for segment-level decomposition (PM-Nav), persistent memory through topological maps (GOAT-Bench), and cost-aware search strategies that prioritize cognitive retrieval over physical exploration (ESearch-R1).
π View major papers in this topic (10)
- Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight (2025-01) 9
- Self-Improving Embodied Foundation Models (2025-09) 9
- DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (2024-06) 9
- LISA: Reasoning Segmentation via Large Language Model (2023-08) 9
- MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks (2026-03) 9
- OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences (2026-03) 9
- LabShield: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories (2026-03) 9
- ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning (2023-09) 8
- Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation (2025-12) 8
- EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents (2025-02) 8
π‘ Diving deeper into Embodied AI and Robotics, let's examine specific research threads that define this area.
Robotic Manipulation and Control
What: Research on enabling robots to perceive, reason about, and physically manipulate objects using vision-language-action models that unify perception, language understanding, and motor control.
Why: Autonomous manipulation is essential for deploying robots in homes, factories, and unstructured environments where tasks require dexterous, adaptive physical interaction.
Baseline: Supervised fine-tuning of vision-language models on expert demonstrations to directly predict robot actions from camera images and language instructions.
- Distribution shift causes compounding errors when robots encounter states unseen during demonstration training
- Balancing high-level semantic reasoning with low-latency motor control for real-time manipulation
- Sparse reward signals make it difficult to learn precise, long-horizon manipulation behaviors from trial-and-error
π§ͺ Running Example
Baseline: A standard imitation learning policy maps the camera image directly to motor commands. It fails when the mug is positioned differently from training data, when nearby objects cause visual confusion, or when the mug slips during grasping β the policy cannot recover from errors not present in its training distribution.
Challenge: This task illustrates three key challenges: (1) distribution shift β the mug's exact position and surrounding clutter vary each time, (2) long-horizon execution β the robot must reach, grasp, transport, and precisely place the mug without dropping it, and (3) latency β the grasp phase demands fast reactive control while planning demands slow deliberation.
π Overall Progress
The field has progressed from simple imitation learning policies to sophisticated VLA architectures that integrate perception, reasoning, and action in unified frameworks. A major paradigm shift occurred with the adoption of reinforcement learning post-training, which broke the 'imitation ceiling' and enabled policies to achieve near-perfect success rates through self-improvement. Simultaneously, dual-system architectures and inference acceleration have made real-time deployment practical, with systems now operating continuously for hours in unstructured public environments.
π Sub-topics
VLA Architecture and Foundation Models
14 papers
Core architectural innovations for vision-language-action models, including backbone selection, action representation (discrete tokens vs. continuous flow matching), policy head design, and unified training recipes for generalist robot control.
Reinforcement Learning for VLA Post-Training
13 papers
Methods that use reinforcement learning β including PPO, GRPO, and offline RL β to fine-tune pre-trained VLA models beyond supervised imitation learning, addressing distribution shift, reward sparsity, and training instability.
Reasoning-Enhanced Robotic Control
12 papers
Approaches that augment VLA models with explicit chain-of-thought reasoning, visual planning, and structured decision-making to improve generalization, interpretability, and long-horizon task execution.
Vision-Language-Action Models for Autonomous Driving
7 papers
Adaptation of VLA architectures to end-to-end autonomous driving, addressing challenges of physically feasible trajectory generation, adaptive reasoning under varying scenario complexity, and causal understanding for safety-critical decisions.
Scalable Training and Efficient Deployment
10 papers
Infrastructure, data generation pipelines, sim-to-real transfer, and inference acceleration techniques that make VLA models practical for real-world deployment, including distillation, asynchronous distributed training, and autonomous data collection.
Robotic Hardware and Multi-Modal Sensing
5 papers
Innovations in physical robot design including tactile sensors, compliant grippers, and dexterous hands, as well as multi-modal perception systems that integrate vision, touch, and proprioception for contact-rich manipulation.
π‘ Key Insights
π‘ RL post-training breaks the imitation ceiling, enabling 99-100% manipulation success rates
π‘ Dual-system architectures achieve 100Γ control speedup while preserving VLM reasoning
π‘ Explicit reasoning traces improve VLA generalization by 28% without additional robot data
π‘ Dense process reward models enable sample-efficient real-world RL from near-zero performance
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research evolved from discrete action tokenization and pure supervised learning (2023-2024) to continuous flow-matching generation and RL-enhanced self-improvement (2025), and is now converging on hierarchical fast-slow architectures with explicit 3D spatial reasoning and atomic skill decomposition for scalable, deployable robotic intelligence (2026).
- (SpiRobs, 2023) introduced bio-inspired soft manipulators with logarithmic spiral morphology, grasping objects varying by two orders of magnitude in size
- (Minsight, 2023) demonstrated a fingertip-sized vision-based tactile sensor achieving 60 Hz sensing with 0.07 N force accuracy
- (EmbodiedGPT, 2023) pioneered Chain-of-Thought pre-training for embodied agents using the EgoCOT dataset of 2M+ annotated video clips, outperforming BLIP-2 by 22.1%
- Language-Guided Skill Acquisition (Scaling Up and Distilling Down, 2023) showed LLMs can generate diverse training data with auto-verification and distill it into robust multi-task diffusion policies (+33.2% over the LLM collector)
- The VLA survey (A Survey on Vision-Language-Action Models, 2024) established a hierarchical taxonomy organizing over 50 models into distinct architectural families
- (LLaRA, 2024) introduced visuomotor instruction tuning that converts robot data into text-based conversations with self-supervised auxiliary tasks
- ECoT (Robotic Control via Embodied Chain-of-Thought Reasoning, 2024) demonstrated that explicit reasoning traces improve VLA generalization by 28%, outperforming the 55B RT-2-X with only a 7B model
- Maniwhere (Learning to Manipulate Anywhere, 2024) achieved zero-shot sim-to-real transfer across 3 hardware setups using multi-view representation learning with spatial transformers
- Οβ (Οβ, 2024) introduced the flow-matching VLA paradigm, training on 10,000 hours of data across 7 robot configurations for up to 50 Hz control
- (CogACT, 2024) showed that separating cognition (VLM) from action generation (Diffusion Transformer) surpasses OpenVLA by 55% in real-world success
- RoboVLMs (What Matters in Building VLAs, 2024) established systematic design principles, finding that Policy Head formulation and post-training recipes are critical
- (Optimized Fine-Tuning, 2025) identified the optimal recipe of parallel decoding with L1 regression, achieving 97.1% on LIBERO and 26Γ throughput gain over autoregressive methods
- (Magma, 2025) unified spatial-temporal training across 2D and 3D domains, achieving SOTA on both UI navigation and robotic manipulation
π The field shifted from end-to-end imitation learning to structured VLA architectures that separate cognition from action, with Οβ establishing flow matching as the dominant action generation paradigm.
- (VLA-RL, 2025) formulated robot manipulation as multi-turn RL conversations with a Robotic Process Reward Model, matching commercial Ο0-FAST performance
- (Fast-in-Slow, 2025) repurposed VLM final layers as a fast execution module, achieving 117.7 Hz control and +11% over OpenVLA in real-world tasks
- (OneTwoVLA, 2025) unified System 1/2 in a single model with autonomous mode switching, achieving +30% over flat VLA baselines on long-horizon tasks
- (SimpleVLA-RL, 2025) demonstrated that GRPO with dynamic sampling achieves 91.7% from a single demonstration, outperforming Οβ
- (Self-Improving, 2025) achieved 99% simulation and 100% real-world success by training lightweight residual RL agents that correct VLA failures
- RL-100 (RL-100, 2025) achieved 100% success across 1000 real-world evaluations and 7-hour continuous operation in a public shopping mall with zero failures
- (Robo-Dopamine, 2025) trained a General Reward Model on 3,400+ hours of data enabling one-shot policy adaptation from near-zero to 95% success
- Alpamayo-R1 (Alpamayo-R1, 2025) introduced Chain of Causation reasoning for driving, achieving 35% reduction in close encounters and 45% improvement in reasoning quality
π The field shifted from pure imitation learning to RL-enhanced VLAs, with multiple methods achieving 95-100% success rates and demonstrating hours-long real-world operation without human intervention.
- (VLA-Thinker, 2026) introduced thinking-with-image reasoning where perception is a dynamically invocable action, achieving 97.5% on LIBERO and tripling long-horizon success
- (GST-VLA, 2026) replaced 2D patches with 3D Gaussian spatial tokens encoding surface geometry and orientation, achieving 96.4% on LIBERO (+2.0% SOTA)
- (AtomicVLA, 2026) decomposed tasks into atomic skills with Mixture-of-Experts routing, enabling continual learning of new skills without forgetting (+21% in real-world)
- (CRAFT, 2026) introduced hybrid hard-soft compliance achieving 100% success on fragile tasks and full coverage of all 33 Feix grasp taxonomy types
- IMLE Distillation (From Flow to One Step, 2026) achieved 123.5 Hz single-step inference (14.3Γ speedup) via set-level distillation, enabling dynamic re-planning where slow teachers fail
- (Thousand-GPU, 2026) reduced training time from 15 hours to 22 minutes (40Γ speedup) using asynchronous RL-VLA3 architecture
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Flow-Matching and Diffusion-Based Action Generation | Continuous flow matching integrates directly into VLM backbones, generating precise multi-modal action distributions without discretization artifacts. | Surpasses OpenVLA discrete tokenization by +55% success rate in real-world experiments (CogACT), and achieves 97.1% on LIBERO vs. 76.5% for standard OpenVLA (OFT). | Οβ: A Vision-Language-Action Flow Model... (2024), CogACT (2024), HybridVLA (2025), Fine-Tuning Vision-Language-Action Models (2025) |
| Reinforcement Learning Post-Training for VLAs | Online RL enables VLA models to discover recovery behaviors and novel strategies never shown in human demonstrations, breaking the imitation ceiling. | PLD achieves 99% success on LIBERO vs. SFT baselines failing on recovery tasks; SimpleVLA-RL achieves 91.7% on LIBERO-Long with one demo vs. 17.1% for SFT (+74.6%). | Self-Improving (2025), SimpleVLA-RL (2025), RL-100 (2025), StARe-VLA (2025) |
| Embodied Chain-of-Thought Reasoning | Interleaving semantic reasoning with spatial grounding forces the model to 'look before acting', improving generalization without additional robot data. | ECoT improves OpenVLA by +28% absolute success rate on generalization tasks; VLA-Thinker achieves 97.5% on LIBERO vs. 91.0% for OpenVLA-OFT (+6.5%). | Robotic Control via Embodied Chain-of-Thought... (2024), Fast ECoT (2025), VLA-Thinker (2026), MolmoAct (2025) |
| Dual-System Hierarchical Control | Inspired by Kahneman's System 1/2 theory, cached semantic features from a slow VLM enable a fast policy to act at 100+ Hz without re-querying the large model. | Fast-in-Slow achieves 117.7 Hz control and outperforms OpenVLA by +11% in real-world tasks; HAMSTER improves over OpenVLA by 20% across seven generalization axes. | Fast-in-Slow (2025), HAMSTER (2025), OneTwoVLA (2025), SaiVLA-0 (2026) |
| Dense Process Reward Modeling | A general reward model trained on multi-view data predicts relative progress between states, providing policy-invariant dense rewards without altering the optimal policy. | Robo-Dopamine improves success from near-zero to 95% with only 150 rollouts (~1 hour); SARM achieves 83% on real-world T-shirt folding vs. 8% for vanilla Behavior Cloning. | Robo-Dopamine (2025), SARM (2025), VLA-RL (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LIBERO | Success Rate | 99.0% | Self-Improving (2025) |
| SimplerEnv | Success Rate | 98.0% | StARe-VLA (2025) |
| Real-World Multi-Task Manipulation | Success Rate | 100.0% | RL-100 (2025) |
| CALVIN | Average Successful Task Length | 4.25 tasks | What Matters in Building Vision-Language-Action... (2024) |
β οΈ Known Limitations (4)
- Simulation-to-real transfer gap: Most results are demonstrated in simulation, and policies often degrade significantly when deployed on physical hardware due to visual, dynamic, and kinematic differences. (affects: Flow-Matching Action Generation, Embodied Chain-of-Thought Reasoning, RL Post-Training for VLAs)
Potential fix: Curriculum-based domain randomization (Maniwhere) and robustness-aware regularization (RobustVLA) that penalizes sensitivity to visual and execution perturbations - Inference latency vs. reasoning depth trade-off: Large VLA models with chain-of-thought reasoning generate outputs too slowly for real-time control, with standard ECoT requiring ~5.5 seconds per step. (affects: Embodied Chain-of-Thought Reasoning, Dual-System Hierarchical Control)
Potential fix: Temporal caching and asynchronous reasoning (Fast ECoT achieves 7.5x speedup), dual-system architectures (Fast-in-Slow at 117.7 Hz), and IMLE distillation (123.5 Hz single-step inference) - Data scarcity and embodiment diversity: High-quality robot demonstration data is expensive to collect, and policies trained on one robot body often fail to transfer to different morphologies or end-effectors. (affects: Flow-Matching Action Generation, RL Post-Training for VLAs)
Potential fix: LLM-guided autonomous data generation (Scaling Up and Distilling Down), learning from off-domain data like human videos (ZeroWBC, HAMSTER), and self-resetting collection loops (RoboClaw) - Safety and robustness in unstructured environments: VLA models lack formal safety guarantees and can fail unpredictably when encountering visual clutter, adversarial objects, or out-of-distribution scenarios. (affects: Flow-Matching Action Generation, Embodied Chain-of-Thought Reasoning, Dense Process Reward Modeling)
Potential fix: Subtractive visual distillation that removes clutter from inputs (CGVD improves by +34.5%), Jacobian regularization for input sensitivity (RobustVLA), and closed-loop verification with error recovery (Agentic Robot)
π View major papers in this topic (10)
- Robotic Control via Embodied Chain-of-Thought Reasoning (2024-07) 9
- Οβ: A Vision-Language-Action Flow Model for General Robot Control (2024-10) 8
- Self-Improving Vision-Language-Action Models with Data Generation via Residual RL (2025-10) 9
- SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning (2025-09) 9
- RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning (2025-10) 9
- Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation (2025-12) 9
- Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (2025-02) 9
- Magma: A Foundation Model for Multimodal AI Agents (2025-02) 9
- Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail (2025-10) 9
- VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning (2026-03) 8
π‘ Within the same paradigm, another important research direction focuses on Autonomous Driving.
Autonomous Driving
What: Research on enabling vehicles to perceive, reason about, and navigate complex traffic environments autonomously using multi-modal sensors and learned decision-making models.
Why: Safe and reliable self-driving requires bridging perception, reasoning, and planning in dynamic, unpredictable environments with diverse road users and rare edge cases.
Baseline: Traditional modular pipelines with separate perception, prediction, and rule-based planning components connected through hand-crafted interfaces and HD maps.
- Long-tail scenarios with rare events lack sufficient training data, causing brittle failures in safety-critical situations
- Bridging high-level semantic reasoning with physically feasible, temporally consistent trajectory generation remains difficult
- Fusing heterogeneous sensor modalities while handling calibration errors, occlusions, and adverse weather conditions
π§ͺ Running Example
Baseline: A traditional modular pipeline detects the static barriers via LiDAR and camera but fails to predict the workers' intent to cross, cannot reason about the oncoming overtake as a coordinated social interaction, and generates a jerky stop-and-go trajectory due to conflicting rule-based heuristics.
Challenge: This scenario is a long-tail event rarely seen in training data, requires understanding human intent and social negotiation, demands robust sensor fusion under unusual road conditions, and needs temporally smooth planning that respects vehicle dynamics.
π Overall Progress
The field has undergone two major paradigm shifts: first, from modular pipelines to end-to-end learned systems (2023β2024), and then from pure imitation learning to reinforcement-learning-enhanced VLA models with structured reasoning (2025β2026). Multi-modal perception matured from basic LiDAR-camera concatenation to robust semantic fusion handling adverse conditions and missing modalities. Planning evolved from deterministic trajectory generation to probabilistic, momentum-stabilized approaches with world-model-based safety verification.
π Sub-topics
Vision-Language-Action Models for Driving
7 papers
End-to-end driving architectures that combine vision-language understanding with action generation, typically refined via reinforcement learning to produce physically feasible trajectories beyond imitation learning.
End-to-End Planning and Trajectory Optimization
5 papers
Methods that replace modular planning pipelines with learned systems that directly score, generate, or refine trajectory candidates from sensor inputs, handling multi-modal driving behavior and temporal consistency.
Reasoning, Chain-of-Thought, and Interpretability
7 papers
Approaches that enhance autonomous driving with structured reasoning chains, retrieval-augmented learning, and human-feedback mechanisms to improve decision interpretability and generalization.
Multi-Modal 3D Perception and Sensor Fusion
7 papers
LiDAR-camera fusion methods for 3D object detection, semantic segmentation, occupancy prediction, and map construction that handle modality heterogeneity, field-of-view mismatches, and adverse conditions.
World Models and Trajectory Prediction
5 papers
Internal predictive models that forecast future environment states or agent trajectories, enabling safer planning through imagination-based evaluation and handling variable-length or incomplete observations.
π‘ Key Insights
π‘ Reinforcement learning transforms VLA models from passive imitators to adaptive driving agents
π‘ Adaptive reasoning depth saves 14% inference time by bypassing chain-of-thought in simple scenarios
π‘ Cross-modal feature completion enables robust perception even with complete camera failure
π‘ Trajectory momentum and probabilistic vocabularies eliminate jittery one-shot planning failures
π‘ World models reduce real-world data needs by enabling policy training entirely in imagination
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has rapidly converged on VLA architectures as the dominant paradigm, with key innovations in adaptive reasoning depth (fast vs. slow thinking), data-efficient RL training, and cognitive world models that prioritize task-relevant abstraction over pixel-level reconstruction.
- MSeg3D (MSeg3D, 2023) introduced semantic-based fusion and cross-modal feature completion, achieving 81.14 mIoU on nuScenes even with zero cameras available
- (Multi-Modal, 2023) systematized LiDAR-camera fusion approaches into a unified taxonomy
- (DriveMLM, 2023) was among the first to align LLM outputs with standardized vehicle control states, achieving 76.1 Driving Score on CARLA Town05 Long
π LLMs were first bridged to vehicle control through standardized behavioral planning states, moving beyond pure language outputs.
- VADv2 (VADv2, 2024) pioneered probabilistic planning with a 4,096-trajectory vocabulary, achieving SOTA closed-loop driving on CARLA
- (RAG-Driver, 2024) introduced retrieval-augmented in-context learning for zero-shot driving generalization without fine-tuning
- (RoboFusion, 2024) adapted the Segment Anything Model (SAM) for robust 3D detection under adverse weather, improving +6.51% mAP on corrupted benchmarks
- TOKEN (Tokenize the World into Object-level Knowledge, 2024) addressed long-tail failures by tokenizing the world into structured object-level representations, reducing collision rates by up to 100% in specific scenarios
- (PlanAgent, 2024) demonstrated the first closed-loop mid-to-mid MLLM planning agent, outperforming both rule-based and learning-based baselines on nuPlan
- (PKRD-CoT, 2024) designed structured chain-of-thought prompting that improved driving decision accuracy by 22% over standard zero-shot approaches
- (MomAD, 2025) introduced trajectory and perception momentum, reducing collision rate by 26% and improving trajectory consistency by 33% over SparseDrive
- (Generalized Trajectory Scoring, 2025) won the NAVSIM v2 Challenge with 49.4 EPDMS using a super-dense 16k trajectory scorer combined with diffusion-based generation
- (IRL-VLA, 2025) proposed a Reward World Model via Inverse RL, eliminating expensive sensor simulation for VLA training and securing 1st runner-up in the CVPR 2025 Grand Challenge
- Alpamayo-R1 (Alpamayo-R1, 2025) achieved the highest breakthrough with causally-grounded reasoning that uses RL to align reasoning with action, improving safety by 35%
- (NoRD, 2026) proved that VLAs can drive competitively with 60% less data and zero reasoning annotations using difficulty-aware Dr. GRPO optimization
- (CoT, 2025; Reasoning in AD Survey, 2026) formalized the evolution from rule-driven to knowledge-driven autonomous driving paradigms
π The field shifted from imitation-only training to RL-enhanced VLA models with adaptive reasoning depth, representing a move from data-driven to knowledge-driven autonomous driving.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Reinforcement-Learning-Enhanced VLA Driving | Uses RL reward signals from world models or physics constraints to refine VLA policies, with adaptive fast/slow reasoning for efficiency. | Improves on standard GRPO-based VLAs by +11.68% PDM score using Dr. GRPO on NAVSIM; Alpamayo-R1 achieves +12% planning accuracy and 35% reduction in close encounter rate over trajectory-only baselines. | Alpamayo-R1 (2025), IRL-VLA (2025), NoRD (2026), AdaThinkDrive (2025) |
| Chain-of-Thought Reasoning for Driving | Forces models through explicit cognitive stages (perceive, recall knowledge, reason, decide) that mimic human driving cognition for interpretability. | PKRD-CoT improves decision-making accuracy by +22% over standard zero-shot prompts in ablation studies; GPT-4 achieves 100% accuracy in mathematical reasoning tasks within the framework. | PKRD-CoT (2024), DriveCoT (2024), RAG-Driver (2024) |
| Multi-Modal Sensor Fusion for 3D Perception | Uses cross-modal semantic alignment and adaptive feature gating to combine complementary strengths of sparse LiDAR geometry with dense camera texture. | MSeg3D achieves 81.14 mIoU on nuScenes test, +1.18 over previous best 2D3DNet; RoboFusion improves +6.51% mAP on KITTI-C (corrupted) over TransFusion baseline. | MSeg3D (2023), RoboFusion (2024), Multi-Modal (2024) |
| End-to-End Trajectory Planning and Scoring | Discretizes continuous planning into large trajectory vocabularies and uses learned scoring or probabilistic sampling to select temporally consistent optimal paths. | GTRS achieves 49.4 EPDMS on NAVSIM v2 Challenge (winning entry), approaching privileged planner PDM-Closed; MomAD reduces collision rate by 26% and improves trajectory consistency by 33% over SparseDrive. | Generalized Trajectory Scoring for End-to-end... (2025), VADv2 (2024), Don't Shake the Wheel: Momentum-Aware... (2025) |
| Driving World Models | Predicts future environment states using action-conditioned generative models, allowing candidate trajectories to be evaluated in imagination before execution. | Drive-OccWorld improves occupancy forecasting by +9.5% mIoU and +5.1% VPQ over prior methods on nuScenes; Kinematics-Aware WM achieves +23.1% Mean Return over image-only world model baselines. | Driving in the Occupancy World:... (2024), Constructing the Umwelt (2025), Kinematics-Aware (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| NAVSIM v2 (Navhard) | EPDMS (Ego-Pseudo Driving Metric System) | 49.4 EPDMS | Generalized Trajectory Scoring for End-to-end... (2025) |
| CARLA Town05 Long | Driving Score (DS) | 76.1 DS | DriveMLM (2023) |
| nuScenes Test (3D Segmentation) | mIoU (mean Intersection over Union) | 81.14 mIoU | MSeg3D (2023) |
| KITTI-C (Corrupted) | mAP (mean Average Precision) | +6.51% mAP over TransFusion baseline | RoboFusion (2024) |
β οΈ Known Limitations (4)
- Sim-to-real domain gap: Models trained in simulators (CARLA, NAVSIM) or with synthetic corruptions may not transfer reliably to real-world driving conditions with novel sensor noise and lighting. (affects: Reinforcement-Learning-Enhanced VLA Driving, Driving World Models, End-to-End Trajectory Planning and Scoring)
Potential fix: IRL-VLA proposes Reward World Models that bypass sensor simulation entirely; domain randomization and progressive real-world fine-tuning are emerging strategies. - Computational overhead of reasoning: Chain-of-thought and VLA reasoning add significant latency, which conflicts with the real-time requirements of autonomous driving at highway speeds. (affects: Chain-of-Thought Reasoning for Driving, Reinforcement-Learning-Enhanced VLA Driving)
Potential fix: AdaThinkDrive's adaptive fast/slow mechanism bypasses reasoning in 84% of simple scenarios; NoRD eliminates reasoning annotations entirely while maintaining competitive performance. - Long-tail data scarcity: Rare but safety-critical scenarios (construction zones, emergency vehicles, unusual pedestrian behavior) remain severely underrepresented in training datasets. (affects: Reinforcement-Learning-Enhanced VLA Driving, End-to-End Trajectory Planning and Scoring, Multi-Modal Sensor Fusion for 3D Perception)
Potential fix: TOKEN uses object-level tokenization to leverage LLM reasoning for long-tail generalization; Alpamayo-R1's causal reasoning enables systematic handling of novel scenarios through compositional understanding. - Benchmark-reality disconnect: Current benchmarks primarily evaluate in constrained settings and may not capture the full complexity of real-world social interactions and edge cases. (affects: Chain-of-Thought Reasoning for Driving, End-to-End Trajectory Planning and Scoring, Driving World Models)
Potential fix: Both surveys identify the need for benchmarks that test social-cognitive reasoning, adversarial interactions, and multi-agent negotiation beyond current structured evaluation protocols.
π View major papers in this topic (10)
- Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail (2025-10) 9
- IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model for End-to-End Autonomous Driving (2025-08) 8
- Generalized Trajectory Scoring for End-to-end Multimodal Planning (2025-06) 8
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning (2026-02) 8
- VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning (2024-02) 8
- MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving (2023-03) 8
- DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving (2023-12) 8
- Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving (2025-03) 8
- Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving (2024-07) 8
- PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning (2024-06) 8
π‘ Within the same paradigm, another important research direction focuses on World Models and Simulation.
World Models and Simulation
What: World models learn to predict future environment states given actions, enabling embodied agents to simulate outcomes and plan without costly real-world trial-and-error.
Why: Embodied agents must anticipate consequences of actions to plan safely, especially in driving and manipulation where real-world mistakes are dangerous and irreversible.
Baseline: Imitation learning policies that directly map observations to actions without internal forward simulation or explicit dynamics reasoning.
- Pixel-level future prediction is computationally expensive and often produces physically implausible long-horizon forecasts
- Sim-to-real domain gaps cause world models trained in simulation to fail in real-world deployment
- Learned latent representations frequently lack geometric and kinematic structure needed for safe planning
π§ͺ Running Example
Baseline: An imitation learning policy replays left-turn trajectories from training data but cannot adapt to the specific timing of oncoming cars or pedestrian positions, risking a dangerous merge.
Challenge: The vehicle must predict how oncoming cars will decelerate or maintain speed, whether a pedestrian will enter the crosswalk, and evaluate multiple trajectory optionsβrequiring forward simulation of a dynamic multi-agent scene over several seconds.
π Overall Progress
World models have evolved from pixel-level generative approaches to structured latent-space methods that leverage pre-trained foundation model features. A major paradigm shift occurred with the decomposition of monolithic next-frame prediction into explicit reasoning chains (flow, intent tokens). The field has also expanded from single-domain applications to specialized variants for driving, manipulation, anomaly detection, and planetary-scale environmental monitoring.
π Sub-topics
Autonomous Driving World Models
4 papers
World models specifically designed for self-driving that forecast future road scenesβvia occupancy grids, reward functions, or intent tokensβconditioned on ego-vehicle actions to enable safe trajectory planning.
Robotic Manipulation World Models
3 papers
World models for robot manipulation tasks that learn dynamics in latent spacesβusing pre-trained visual features or motion decompositionβto enable zero-shot planning and failure detection.
Foundation World Model Frameworks and Surveys
3 papers
Conceptual frameworks and comprehensive surveys that define the theoretical underpinnings of world models for embodied AI, including causal reasoning requirements and VLA taxonomies.
Planetary-Scale World Models
1 papers
World models that extend to Earth-scale environments using 4D space-time encodings, enabling self-supervised multi-modal learning for environmental monitoring across vast spatial and temporal ranges.
π‘ Key Insights
π‘ Pre-trained visual features enable zero-shot world model planning without task-specific training.
π‘ Explicit motion reasoning prevents pixel-copying and improves physical plausibility of predictions.
π‘ Lightweight reward world models bypass expensive simulators for closed-loop RL policy optimization.
π‘ Kinematics-grounded latent spaces dramatically reduce data requirements for driving policy learning.
π‘ World models double as anomaly detectors with statistical safety guarantees for deployment.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from conceptual frameworks and surveys (2024) through VLA-integrated cognitive architectures with explicit reasoning (2025) to domain-specialized, data-efficient models with safety guarantees for real-world deployment (2026).
- The causality framework (The Essential Role of Causality..., 2024) articulated why foundation models need causal reasoning for embodied AI, proposing Foundation Veridical World Models (FVWMs).
- A comprehensive VLA survey (A Survey on Vision-Language-Action Models..., 2024) organized Vision-Language-Action models into a hierarchical taxonomy spanning components, control, and planning.
- Drive-OccWorld (Driving in the Occupancy World, 2024) demonstrated that 4D occupancy forecasting conditioned on ego-actions improves planning safety, gaining +9.5% mIoU on nuScenes.
- (DINO-WM, 2024) showed that building world models on frozen DINOv2 features enables zero-shot planning, improving success rate by 45% over IRIS.
π Shift from pixel-level generative world models to structured latent-space and pre-trained-feature-based approaches that prioritize planning utility over visual fidelity.
- Meta's embodied (Embodied AI Agents, 2025) proposed unifying mental and physical world models under JEPA-based architectures, releasing the 4,000-hour Seamless Interaction dataset.
- (IRL-VLA, 2025) introduced Reward World Models via inverse RL, achieving 1st runner-up at the CVPR 2025 Autonomous Grand Challenge with 45.0 EDPMS on NAVSIM v2.
- (FlowVLA, 2025) introduced Visual Chain of Thought that predicts optical flow before appearance, achieving state-of-the-art on CALVIN manipulation benchmarks.
- (Constructing the Umwelt, 2025) replaced dense reconstruction with sparse Intent Tokens via Belief-Intent Co-Evolution for cognitively-inspired planning.
π Emergence of explicit reasoning steps (optical flow, intent tokens) within world models, moving beyond monolithic next-frame prediction to decomposed prediction pipelines.
- (Self-Supervised, 2026) scaled world models to planetary dimensions via 4D hash encoding, achieving 99.3% parameter reduction over Galileo while maintaining accuracy.
- Foundational failure detection (Foundational World Models Accurately Detect..., 2026) applied pre-trained latent-space world models as anomaly detectors for bimanual robots with conformal prediction safety guarantees.
- (Kinematics-Aware, 2026) grounded latent dynamics in explicit vehicle kinematics and spatial structure, improving mean return by 23.1% while reaching stable performance in 80k steps versus 300k+ for PPO.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Pre-trained Feature Latent Dynamics | Use pre-trained patch-level visual features as the state space and train a transformer to predict future features conditioned on actions. | Improves on IRIS by +45% average success rate on the hardest navigation and manipulation tasks, achieving 56% better visual reconstruction fidelity (LPIPS). | DINO-WM (2024), Foundational World Models Accurately Detect... (2026) |
| Action-Conditioned Occupancy Forecasting | Forecast structured 3D space occupancy under different action hypotheses rather than generating raw video frames. | Drive-OccWorld improves on prior occupancy methods by +9.5% mIoU on nuScenes, achieving 38.2% mIoU; Kinematics-Aware model improves +23.1% Mean Return over image-only baselines. | Driving in the Occupancy World:... (2024), Kinematics-Aware (2026) |
| Inverse RL Reward World Models | Learn a differentiable Reward World Model from expert demonstrations that scores trajectories for safety and compliance without sensor simulation. | Achieves 45.0 EDPMS on NAVSIM v2, securing 1st runner-up at CVPR 2025 Autonomous Grand Challenge over prior open-loop VLA baselines. | IRL-VLA (2025) |
| Explicit Motion-Reasoning World Models | Insert an explicit motion-reasoning intermediate step between current observation and future state prediction to enforce physical plausibility. | FlowVLA achieves state-of-the-art on CALVIN robot manipulation benchmarks with substantially improved sample efficiency over UniVLA and WorldVLA baselines. | FlowVLA (2025), Constructing the Umwelt (2025) |
| Planetary-Scale 4D Space-Time World Models | Concatenate features from spatial and spatio-temporal hash grids with learned collision resolution for efficient 4D Earth-scale indexing. | Improves on standard hash encoding by +35.0% RΒ² (0.783 vs 0.58) on Live Fuel Moisture prediction; achieves 99.3% parameter reduction (5M vs 800M) over the Galileo foundation model. | Self-Supervised (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| NAVSIM v2 | EDPMS (Ego-Pseudo Driving Metric System) | 45.0 EDPMS | IRL-VLA (2025) |
| nuScenes Occupancy Forecasting | mIoU (mean Intersection over Union) | +9.5% mIoU over prior state-of-the-art | Driving in the Occupancy World:... (2024) |
| CALVIN Robot Manipulation | Task Success Rate | State-of-the-art (specific value not reported) | FlowVLA (2025) |
| Zero-shot Navigation and Manipulation (MiniGrid, DM Control) | Success Rate | +45% average success rate over IRIS on hardest tasks | DINO-WM (2024) |
| Live Fuel Moisture Content Prediction | RΒ² (coefficient of determination) | 0.783 RΒ² | Self-Supervised (2026) |
β οΈ Known Limitations (4)
- Long-horizon prediction degradation: world model accuracy deteriorates significantly over extended prediction horizons, making multi-second planning unreliable for safety-critical applications. (affects: Pre-trained Feature Latent Dynamics, Action-Conditioned Occupancy Forecasting, Explicit Motion-Reasoning World Models)
Potential fix: Hierarchical prediction at multiple temporal resolutions, or cognitive approaches like TIWM that reason about sparse intents rather than dense pixel-level futures. - Sim-to-real transfer gap: world models trained in simulation or on offline data may not faithfully represent real-world physics, leading to planning failures during deployment. (affects: Action-Conditioned Occupancy Forecasting, Inverse RL Reward World Models)
Potential fix: Foundation Veridical World Models with causal reasoning as proposed in the causality framework, or grounding latent spaces in explicit kinematics to enforce physical consistency. - Lack of unified evaluation: no standardized benchmark exists across driving, manipulation, and other domains, making it difficult to compare world model approaches and track overall field progress. (affects: Pre-trained Feature Latent Dynamics, Action-Conditioned Occupancy Forecasting, Explicit Motion-Reasoning World Models, Planetary-Scale 4D Space-Time World Models)
Potential fix: Establishing cross-domain benchmark suites that test both prediction fidelity and downstream planning performance, as advocated by VLA surveys. - Computational overhead: training and running world models adds significant cost on top of base policies, particularly for methods requiring high-resolution 3D occupancy prediction or multi-modal fusion. (affects: Action-Conditioned Occupancy Forecasting, Planetary-Scale 4D Space-Time World Models)
Potential fix: Parameter-efficient approaches like DeepEarth's 4D hash encoding (99.3% parameter reduction) or compact latent-space methods like DINO-WM and the Cosmos-based failure detector (1/20th parameters).
π View major papers in this topic (11)
- IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model for End-to-End Autonomous Driving (2025-08) 8
- FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models (2025-08) 8
- DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (2024-11) 8
- Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding (2026-03) 8
- Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving (2024-08) 7
- The Essential Role of Causality in Foundation World Models for Embodied AI (2024-02) 7
- A Survey on Vision-Language-Action Models for Embodied AI (2024-05) 7
- Embodied AI Agents: Modeling the World (2025-06) 7
- Constructing the Umwelt: Cognitive Planning through Belief-Intent Co-Evolution (2025-10) 7
- Foundational World Models Accurately Detect Bimanual Manipulator Failures (2026-03) 7
- Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving (2026-03) 7
π‘ Moving to the next paradigm, we turn to Multimodal Generation.
Multimodal Generation
What: Research on generating content across multiple modalities (images, 3D, video, speech) using unified generative frameworks including diffusion models, flow matching, and reinforcement-learning-enhanced optimization.
Why: Enabling machines to create high-quality, controllable content across modalities is essential for creative tools, robotics, scientific discovery, and human-AI interaction.
Baseline: Standard generative models (GANs, vanilla diffusion) produce content in single modalities with limited controllability and no cross-modal consistency guarantees.
- Sparse reward signals in RL-based generation fail to credit individual denoising steps appropriately
- Maintaining geometric consistency and fine-grained details across 3D views and multimodal outputs
- Balancing identity fidelity with editability and safety in personalized generation tasks
π§ͺ Running Example
Baseline: A standard diffusion model generates plausible 2D images but produces inconsistent geometry across views, lacks realistic view-dependent reflections, and cannot incorporate human preference feedback to iteratively improve quality.
Challenge: This example requires 3D geometric consistency (multi-view coherence), view-dependent appearance modeling (metallic reflections), and fine-grained reward decomposition to identify which denoising steps contribute to texture quality versus geometric accuracy.
π Overall Progress
The field has evolved from isolated single-modality generation toward unified frameworks that handle multiple tasks (generation, optimization, planning) within a single model. The most significant paradigm shift has been the adoption of reinforcement learning β particularly GRPO variants β as a universal fine-tuning strategy across 2D, 3D, and embodied generation domains, with increasingly sophisticated reward decomposition. Simultaneously, theoretical work has matured, providing rigorous mathematical foundations (Wasserstein gradient flows, topological analysis) for emerging generative approaches.
π Sub-topics
RL-Optimized Multimodal Generation
5 papers
Applying reinforcement learning β particularly Group Relative Policy Optimization (GRPO) variants β to improve generative model outputs across image, 3D, and robotic domains by optimizing reward signals during the denoising process.
Diffusion-Based 3D & Scene Generation
4 papers
Using diffusion models and flow matching to generate 3D scenes, motion plans, and structural ensembles with physics-based constraints and multi-view consistency.
Multimodal Understanding, Reasoning & Embeddings
4 papers
Methods for jointly reasoning across modalities, producing unified embeddings, modeling inter/intra-modality dependencies, and auditing black-box vision systems through semantic approaches.
Personalized & Empathetic Generation
2 papers
Generating identity-preserving portraits and emotionally responsive multimodal content (text, voice, avatar) that maintains consistency across attributes and modalities.
Generative AI Foundations & Surveys
4 papers
Theoretical frameworks for generative modeling (gradient flows, topological analysis) and comprehensive surveys mapping the AIGC landscape from GANs to multimodal LLMs.
Domain-Specific & Applied Generation
6 papers
Application of multimodal generative models to specialized domains including molecular design, nanophotonic fabrication, medical imaging, and federated learning for tactile internet.
Human-AI Co-Creation & Interaction
5 papers
Studies on how humans collaborate with generative AI tools for creative design, exploring prompt strategies, trust dynamics, context-aware generation workflows, and frameworks for collaborative ideation.
π‘ Key Insights
π‘ RL fine-tuning via GRPO variants improves generation quality across 2D, 3D, and robotics
π‘ Step-wise reward decomposition significantly outperforms sparse terminal rewards for denoising
π‘ Small RL-tuned models (2B parameters) can surpass large proprietary models on specialized tasks
π‘ Unified diffusion handles generation, optimization, and planning within one framework
π‘ Reasoning before embedding boosts multimodal representation quality by over 10%
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from foundational diffusion architectures and surveys (2023) through specialized multi-modal and personalized generation (2024) to RL-optimized generation with principled step-wise reward design (2025β2026), with a clear trend toward smaller models achieving parity with large proprietary systems through targeted RL training.
- SceneDiffuser (Diffusion-based Generation, Optimization, and Planning..., 2023) introduced unified diffusion for joint 3D scene generation, physics optimization, and planning β achieving 49.35% physical plausibility vs 14.64% for cVAE baselines
- (Image Hijacks, 2023) revealed critical vulnerabilities in VLMs through adversarial image optimization, achieving 100% attack success rate
- The AIGC Survey (A Comprehensive Survey of AI-Generated Content, 2023) mapped the generative AI landscape from GANs to ChatGPT, identifying the Transformer as the convergence point for vision and language
- (Full-Atom, 2024) pioneered multi-modal Riemannian flow matching across four geometric manifolds (R3, SO(3), Hypertorus, Simplex) for molecular design
- GUMP (Solving Motion Planning Tasks with..., 2024) demonstrated a single generative world model serving simultaneously as simulator, planner, and RL environment for autonomous driving
- (UniPortrait, 2024) solved multi-identity image personalization with plug-and-play ID embedding decoupling and spatial routing, outperforming InstantID and FastComposer
- I2M2 (Jointly Modeling Inter- & Intra-Modality Dependencies, 2024) introduced a Product of Experts approach to dynamically leverage inter- and intra-modality dependencies for multi-modal learning
- Hi-GRPO (Are We Ready for RL..., 2025) conducted the first systematic study of RL for 3D generation with hierarchical reward decomposition, achieving 28.5 CLIP Score on MME-3DR
- (Self-Evolving, 2025) solved entropy collapse via asynchronous on-the-fly data synthesis with diversity rewards, improving +3.4% over Visual-RFT
- TP-GRPO (Alleviating Sparse Rewards in Flow-Based GRPO, 2026) replaced sparse terminal rewards with incremental per-step credit assignment and turning-point detection for flow-based generation
- (Think-Then-Embed, 2025) bridged generative reasoning and embedding quality by introducing intermediate reasoning traces, achieving 71.5% state-of-the-art on MMEB-V2
- (Gradient Flow Drifting, 2026) unified the theoretical foundations of drifting generative models through Wasserstein gradient flow equivalence, enabling principled divergence mixing
- LiTo (Surface Light Field Tokenization, 2026) introduced the first latent 3D representation jointly modeling geometry and view-dependent appearance with spherical harmonics
π Reinforcement learning β especially GRPO variants β became the dominant paradigm for improving generative outputs across 2D images, 3D assets, and robotic policies, replacing purely supervised or GAN-based optimization with reward-driven trajectory comparison.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Group Relative Policy Optimization for Generation | Rank groups of generation trajectories by reward signals and optimize the policy to favor higher-ranked outputs, with variants addressing step-wise credit assignment, hard-case mining, and data diversity. | Hi-GRPO improves on base ShapeLLM-Omni by +8.7 CLIP Score on MME-3DR, achieving 28.5 vs 19.8; Syn-GRPO improves on Visual-RFT by +3.4% accuracy on RefCOCOg; HCM-GRPO with a 2B model surpasses GPT-4o by +20 points on aesthetic reasoning. | Alleviating Sparse Rewards by Modeling... (2026), Image Aesthetic Reasoning via HCM-GRPO:... (2025), Syn-GRPO (2025), Are We Ready for RL... (2025), Reinforcement Learning for Flow-Matching Policies (2025) |
| Unified Diffusion for 3D Scene Understanding | Inject physics constraints (collision, contact) and goals (target location) as differentiable gradients during each denoising step, replacing separate planners and optimizers with one sampling loop. | SceneDiffuser achieves 49.35% physically plausible human poses vs 14.64% for cVAE baselines (+34.7 pp); attains 71.27% grasp success where cVAE+optimization fails completely (0.00%). | Diffusion-based Generation, Optimization, and Planning... (2023), Solving Motion Planning Tasks with... (2024) |
| Identity-Preserving Personalized Generation | Decouple identity into intrinsic features and spatial structure branches, with dynamic ID routing that assigns the best-matching identity to each spatial location during generation. | UniPortrait achieves higher identity similarity (CS-I) and prompt consistency (CLIP-T) than InstantID and IP-Adapter-FaceID-Plus on single-ID benchmarks; outperforms FastComposer on multi-ID customization. | UniPortrait (2024), E3RG (2025) |
| Think-Then-Embed Multimodal Reasoning | Generate an Embedding-Centric Reasoning (ECR) trace before creating the embedding, conditioning the final representation on both the original input and the intermediate reasoning. | TTEt-7B achieves 71.5% on MMEB-V2, surpassing proprietary models like seed-1.6-embedding; TTEs-7B outperforms VLM2Vec-V2 by +7.4% on MMEB-V1, achieving state-of-the-art; TTEt-2B improves over VLM2Vec-V2 2B by +10.6% on MMEB-V2. | Think-Then-Embed (2025) |
| Gradient Flow Drifting Framework | The drifting field in generative drifting models equals the Wasserstein-2 gradient flow velocity for KDE-smoothed KL divergence, generalizable to any f-divergence or MMD. | Provides the first rigorous theoretical foundation for Drifting Models, which previously relied on heuristic analysis; generalizes to arbitrary f-divergences (Reverse KL, Chi-squared) and principled mixing of mode-seeking and mode-covering flows. | Gradient Flow Drifting (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MME-3DR (3D Generation Quality) | CLIP Score | 28.5 CLIP Score | Are We Ready for RL... (2025) |
| MMEB-V2 (Massive Multimodal Embedding Benchmark) | Average Score (%) | 71.5% | Think-Then-Embed (2025) |
| 3D Scene Physical Plausibility | Physical Plausibility Rate (%) | 49.35% | Diffusion-based Generation, Optimization, and Planning... (2023) |
| RefCOCOg (Referring Expression Comprehension) | Accuracy (%) | +3.4% over Visual-RFT baseline | Syn-GRPO (2025) |
| Image Aesthetic Reasoning Benchmark | Accuracy Score | 64.74 | Image Aesthetic Reasoning via HCM-GRPO:... (2025) |
β οΈ Known Limitations (4)
- Reward design sensitivity: RL-based generation methods are highly sensitive to reward model choice and design β poor rewards lead to mode collapse or reward hacking rather than genuine quality improvement, especially for 3D tasks with higher spatial complexity. (affects: Group Relative Policy Optimization for Generation (GRPO Family), Unified Diffusion for 3D Scene Understanding)
Potential fix: Hierarchical reward decomposition (Hi-GRPO) and incremental per-step rewards (TP-GRPO) partially address this by providing more granular, less noisy feedback signals; ensemble reward models combining human preference, aesthetic, and LMM-based evaluators further improve robustness. - Computational cost and scalability: RL fine-tuning requires generating multiple complete trajectories per optimization step, multiplying training cost; diffusion-based 3D methods require many iterative denoising steps per sample, limiting real-time applicability. (affects: Group Relative Policy Optimization for Generation (GRPO Family), Unified Diffusion for 3D Scene Understanding)
Potential fix: Partial-autoregressive decoding (GUMP) and variable-horizon generation reduce inference cost by 50-85%; asynchronous data synthesis (Syn-GRPO) improves training efficiency by generating diverse samples on-the-fly. - Evaluation gaps: Current benchmarks typically measure single aspects (e.g., CLIP alignment, FID) while ignoring perceptual quality, physical plausibility, or user preference holistically, making comprehensive quality assessment of multimodal generation difficult. (affects: Group Relative Policy Optimization for Generation (GRPO Family), Identity-Preserving Personalized Generation)
Potential fix: HCM-GRPO proposes dedicated aesthetic reasoning benchmarks with 128k samples; combining multiple reward models (human preference, aesthetic, LMM-based) provides more holistic multi-dimensional evaluation. - Adversarial vulnerability: Generative models with continuous image input channels are susceptible to adversarial manipulation, where optimized images can completely override intended model behavior with near-perfect success rates, and existing safety mechanisms provide no defense. (affects: Identity-Preserving Personalized Generation, Think-Then-Embed Multimodal Reasoning)
Potential fix: Current text-based safety training is ineffective against image-channel attacks; robust adversarial training against image perturbations and input validation pipelines are needed but remain an open research problem.
π View major papers in this topic (10)
- Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation (2025-12) 8
- Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning (2025-11) 8
- Diffusion-based Generation, Optimization, and Planning in 3D Scenes (2023-01) 8
- Solving Motion Planning Tasks with a Scalable Generative Model (2024-07) 8
- UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization (2024-08) 8
- Think-Then-Embed: Transforming MLLMs into Personalized Multimodal Embedding Models (2025-12) 8
- Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences (2026-03) 8
- Image Hijacks: Adversarial Images can Control Generative Models at Runtime (2023-09) 8
- LiTo: Surface Light Field Tokenization (2026-03) 8
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT (2023-12) 8
π‘ Diving deeper into Multimodal Generation, let's examine specific research threads that define this area.
Text-to-Image Generation
What: Text-to-Image Generation synthesizes high-fidelity images from natural language descriptions using diffusion models, autoregressive transformers, and flow-matching architectures.
Why: Enabling anyone to create photorealistic or artistic images from text unlocks creative workflows across design, entertainment, education, and scientific visualization.
Baseline: Standard text-to-image diffusion models generate images by iteratively denoising random noise conditioned on CLIP-encoded text embeddings via a U-Net backbone.
- Aligning generated images with complex compositional prompts involving multiple objects, attributes, and spatial relationships
- Preserving specific subject identity while maintaining text editability and generation diversity
- Achieving high-quality generation efficiently with reduced inference steps and computational cost
π§ͺ Running Example
Baseline: A standard diffusion model generates a generic golden retriever (not the user's specific dog), may omit or misplace the sunglasses, and struggle with the spatial relationship between the dog, skateboard, and street scene.
Challenge: This prompt requires compositional reasoning (multiple objects with specific attributes), identity preservation (user's specific dog), and spatial understanding (dog on skateboard in a street scene). It also requires generating the image efficiently for interactive use.
π Overall Progress
Text-to-image generation has progressed from basic supervised diffusion models to a mature ecosystem encompassing RL-aligned generation, explicit reasoning pipelines, and efficient one-step synthesis. The field witnessed three paradigm shifts: from supervised to RL-based alignment (2023), from generic to identity-preserving personalization (2024), and from direct text-to-image mapping to reasoning-guided generation (2025β2026). The GRPO framework emerged as the dominant alignment paradigm, spawning over 30 specialized variants in 2025 alone.
π Sub-topics
RL-Based Preference Alignment
55 papers
Methods that apply reinforcement learningβparticularly Group Relative Policy Optimization (GRPO) and its variantsβto align diffusion and flow-matching models with human preferences using reward signals.
Reward Modeling & Direct Preference Optimization
35 papers
Reward models for evaluating generated images and DPO-based methods that bypass explicit reward functions by learning directly from preference pairs to fine-tune diffusion models.
Reasoning-Enhanced Generation
20 papers
Methods that inject explicit chain-of-thought reasoning, visual planning, or code-based planning before or during the image generation process to improve compositional accuracy.
Personalized & Subject-Driven Generation
60 papers
Techniques for customizing text-to-image models to generate images of specific user-provided subjects (faces, objects, styles) while maintaining text-based editability and multi-subject composition.
Efficient Inference & Model Compression
25 papers
Post-training quantization, one-step distillation, token merging, and architectural efficiency methods that enable fast deployment of large-scale diffusion and transformer-based generators.
Unified Multi-Modal Architectures
25 papers
Models that unify text understanding, image generation, and other modalities (audio, video) within a single architecture, enabling interleaved generation and any-to-any transformation.
Safety, Robustness & Evaluation
18 papers
Concept erasure, adversarial robustness, watermarking, and evaluation benchmarks that ensure generated content is safe, attributable, and faithfully measured.
π‘ Key Insights
π‘ GRPO-based RL has become the dominant alignment paradigm with 30+ variants in 2025 alone
π‘ Chain-of-thought reasoning before generation boosts compositional accuracy by 13β68%
π‘ One-step models now outperform multi-step giants when combined with score-based alignment
π‘ Identity personalization shifted from minutes of fine-tuning to seconds of zero-shot encoding
π‘ Early denoising steps determine semantic diversity while late steps control fine-grained detail
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has converged on three frontiers: (1) increasingly sophisticated RL alignment methods that exploit temporal structure in diffusion, (2) chain-of-thought reasoning that decomposes complex prompts before generation, and (3) unified architectures that merge understanding and generation in a single model.
- (DPOK, 2023) established online RL for diffusion with KL-regularized policy gradients
- D3PO (Using Human Feedback to Fine-tune..., 2023) applied DPO directly to diffusion's multi-step MDP, eliminating separate reward models
- DRaFT (Directly Fine-Tuning Diffusion Models on..., 2023) pioneered backpropagation through the full sampling chain, >200x faster than RL
- CM3Leon (Scaling Autoregressive Multi-Modal Models, 2023) achieved zero-shot FID 4.88, proving autoregressive models can rival diffusion with 5x less compute
- PTQD (Accurate Post-Training Quantization for Diffusion Models, 2023) introduced quantization noise correction that absorbs error into diffusion variance
π The shift from supervised fine-tuning to reinforcement learning for diffusion model alignment, with DPOK and D3PO establishing the multi-step MDP framework.
- (InstantID, 2024) enabled plug-and-play face personalization using face recognition embeddings
- Diff-Instruct* (David and Goliath, 2024) showed a 2.6B one-step model can outperform 12B FLUX-dev using score-based divergence RLHF
- PTQ4DiT (Post-training Quantization for Diffusion Transformers, 2024) achieved the first effective 4-bit weight quantization for DiT architectures
- Large-Scale RL (Large-scale Reinforcement Learning for Diffusion Models, 2024) scaled RL to millions of prompts with distribution-based fairness rewards
- (MoT, 2024) matched dense baseline performance using only 55.8% of training FLOPs via modality-specific parameter routing
- (Flow-GRPO, 2025) introduced ODE-to-SDE conversion enabling online RL for flow models, boosting GenEval from 63% to 95%
- (DanceGRPO, 2025) unified GRPO for diffusion and rectified flow, scaling stably to 10,000+ prompts
- T2I-R1 (T2I-R1, 2025) introduced bi-level chain-of-thought (semantic + token level) for reasoning-enhanced generation
- (APT, 2025) enabled one-step 1280Γ720 video generation by training against real data rather than distilling from a teacher
- Seedream 4.0 (Seedream 4.0, 2025) unified T2I, editing, and multi-image composition, ranking #1 on Artificial Analysis Arena
- (EndoCoT, 2026) scaled chain-of-thought to diffusion transformers, achieving 92.1% on complex reasoning benchmarks
π The explosion of GRPO variants transformed visual generation alignment from unstable RL into a principled, scalable framework, while chain-of-thought reasoning bridged the gap between language understanding and visual synthesis.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Group Relative Policy Optimization for Visual Generation | Generate multiple images per prompt, compute relative rewards within the group, and update the policy to favor high-reward trajectories over low-reward ones. | Improves on DDPO/DPOK by scaling to 10,000+ prompts stably. Flow-GRPO boosts SD3.5-M GenEval from 63% to 95%, and DanceGRPO achieves +181% on VideoAlign motion quality. | Flow-GRPO (2025), DanceGRPO (2025), DiffusionNFT (2025), BranchGRPO (2025), TempFlow-GRPO (2025) |
| Direct Reward Fine-Tuning & Preference Optimization | Treat the sampling chain as a differentiable computation graph and propagate reward signals directly to model parameters, or use preference pairs to implicitly learn optimal rewards. | Improves on standard RL (DDPO) by >200x faster convergence. Diff-Instruct* (2.6B, 1-step) outperforms FLUX-dev (12B, 50-step) on ImageReward and PickScore. | Diff-Instruct*: Small One-step Model Beats... (2024), DRaFT (2023), D3PO (2023), TDM-R1 (2026) |
| Chain-of-Thought Reasoning for Image Generation | Decompose image generation into a reasoning phase (producing plans, layouts, or code scaffolds) and a synthesis phase, optimized jointly via reinforcement learning. | Improves on direct text-to-image generation by +13% on T2I-CompBench (T2I-R1 vs. Janus-Pro) and +68.83% on StructT2IBench (CoCo vs. Bagel baseline). | T2I-R1 (2025), GoT (2025), CoCo (2026), EndoCoT (2026) |
| Training-Free Identity-Preserving Personalization | Decouple identity encoding from text conditioning using specialized face or subject encoders injected via parallel attention, ControlNet-like branches, or token replacement. | Improves on DreamBooth by eliminating fine-tuning (100x speedup from minutes to seconds) while achieving competitive or superior identity fidelity. InstantID matches LoRA methods using a single reference image. | InstantID (2024), Personalize Anything for Free with... (2025), JeDi (2024), InfiniteYou (2025) |
| Post-Training Quantization & One-Step Distillation | Adapt quantization parameters to the temporal dynamics of diffusion (progressive calibration, distribution-aware grouping) or replace iterative denoising with one-step adversarial generation. | PTQ4DiT achieves near-lossless W8A8 on DiT-XL where baselines degrade to 58.74 FID. Adversarial Post-Training enables one-step 1280Γ720 video at 24fps on a single H100. | PTQ4DiT (2024), Diffusion Adversarial Post-Training for One-Step... (2025), PTQD (2023), PCR (2023) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| GenEval | Overall Accuracy (%) | 98% (GenEval score 0.98) | DiffusionNFT (2025) |
| T2I-CompBench | Average Compositional Score | +13% over Janus-Pro baseline | T2I-R1 (2025) |
| HPSv2.1 (Human Preference Score v2.1) | HPSv2.1 Score | 31.19 HPSv2.1 | David and Goliath (2024) |
| PickScore | Win Rate / PickScore | 67.48% win-rate on Pick-a-Pic v1 test set | LPO (2025) |
β οΈ Known Limitations (4)
- Reward hacking and mode collapse: Models optimized with RL or DPO tend to exploit imperfect reward models, generating high-scoring but low-diversity or unrealistic images that overfit to the proxy objective. (affects: GRPO-Based Visual Alignment, Direct Reward Fine-Tuning & Preference Optimization)
Potential fix: Pairwise preference rewards (Pref-GRPO), distribution-aware reward bonuses for rare clusters (DiverseGRPO), annealed importance guidance (AIG), and self-entropy regularization (SEE-DPO) all address this by explicitly encouraging diversity. - Identity-editability tradeoff in personalization: Methods that strongly preserve subject identity often lose the ability to follow complex text prompts, while editable methods sacrifice identity fidelity. (affects: Training-Free Identity-Preserving Personalization)
Potential fix: Decoupling identity from text via separate attention branches (Infinite-ID), parallel attention architectures (Imagine yourself), and synthetic data curricula that balance identity and editability during training. - Sparse credit assignment across denoising timesteps: Most RL methods assign a single terminal reward uniformly to all timesteps, ignoring that early steps determine structure while late steps refine details. (affects: GRPO-Based Visual Alignment)
Potential fix: Trajectory branching at specific timesteps (TempFlow-GRPO, BranchGRPO, TreeGRPO), tree-structured rollouts with depth-wise advantage estimation, and chunk-level optimization that groups timesteps by temporal dynamics. - Computational cost of RL training for visual models: Full trajectory sampling, large group sizes, and multi-step backpropagation make RL fine-tuning prohibitively expensive, limiting accessibility. (affects: GRPO-Based Visual Alignment, Direct Reward Fine-Tuning & Preference Optimization)
Potential fix: Prefix reuse in tree-structured rollouts (TreeGRPO, 2.4x speedup), early trajectory pruning via ODE preview (Pro-GRPO), and deterministic ODE-based training that avoids SDE overhead (Neighbor GRPO, 12x fewer forward-backward calculations).
π View major papers in this topic (10)
- Flow-GRPO: Training Flow Matching Models via Online RL (2025-05) 9
- DanceGRPO: Unleashing GRPO on Visual Generation (2025-05) 9
- DiffusionNFT: Online Diffusion Reinforcement with Forward Process (2025-09) 9
- Diff-Instruct*: Small One-step Model Beats Large Diffusion with Score Post-training (2024-10) 9
- Diffusion Adversarial Post-Training for One-Step Video Generation (2025-01) 9
- Seedream 4.0: Toward Next-generation Multimodal Image Generation (2025-09) 9
- InstantID: Zero-shot Identity-Preserving Generation in Seconds (2024-01) 9
- CM3Leon: Scaling Autoregressive Multi-Modal Models (2023-09) 9
- RewardDance: Reward Scaling in Visual Generation (2025-09) 9
- TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward (2026-03) 9
π‘ Within the same paradigm, another important research direction focuses on Text-to-Video Generation.
Text-to-Video Generation
What: Research on generating temporally coherent, high-fidelity video sequences from textual descriptions using diffusion models, autoregressive transformers, and flow matching architectures.
Why: Enabling automated video creation democratizes content production for entertainment, education, robotics simulation, and embodied AI planning.
Baseline: Standard text-to-video diffusion models iteratively denoise latent representations conditioned on text embeddings, requiring 50+ sampling steps with limited motion control.
- Maintaining temporal coherence and motion consistency across frames while scaling to longer durations
- Aligning generated videos with human aesthetic preferences and physical plausibility beyond training data
- Reducing computational cost of iterative diffusion sampling for real-time or interactive applications
π§ͺ Running Example
Baseline: A standard 50-step diffusion model produces a blurry dog-like shape drifting across static flowers with flickering artifacts, inconsistent dog appearance between frames, and the 'sniffing' action entirely ignored β the dog teleports from running to standing
Challenge: This example requires temporal coherence (consistent dog appearance), action understanding (running β stopping β sniffing β looking up), physical plausibility (natural deceleration), and must be generated fast enough for iterative creative workflows
π Overall Progress
Text-to-video generation has evolved from basic text-conditioned diffusion requiring 50+ slow sampling steps to real-time, one-step generation with human-preference alignment. The field underwent two major paradigm shifts: first from supervised training to reward-based RL post-training (2024), then from training-time-only improvement to test-time compute scaling (2025). Simultaneously, the scope expanded dramatically β from short single-clip generation to multi-scene narrative films and physically-grounded world simulation for robotics and autonomous driving.
π Sub-topics
Reward-Based Post-Training & Alignment
13 papers
Methods that use reinforcement learning, reward models, and human preference optimization to align video diffusion models with quality, motion, and text-adherence objectives after initial pre-training. This is the largest and most active sub-topic.
Efficient & Few-Step Video Generation
6 papers
Distillation and adversarial training techniques that reduce the number of sampling steps from 50+ to 1β4 steps, enabling real-time or near-real-time video generation without prohibitive quality loss.
Long-Form & Multi-Scene Narrative Generation
7 papers
Approaches for generating coherent multi-shot, multi-scene videos with consistent characters and narrative structure, extending video generation beyond single-clip outputs to minutes-long storytelling.
Video World Models & Embodied AI
8 papers
Using video generation as physics-aware world simulators for robotics planning, autonomous driving, and embodied agents, where generated videos must be physically consistent and action-conditioned.
Video Personalization & Identity Preservation
5 papers
Techniques for generating videos featuring specific identities, styles, or dynamic concepts from reference images or short clips, without expensive per-subject test-time optimization.
Audio-Visual Joint Generation
3 papers
Unified frameworks that generate synchronized audio and video simultaneously, including speech with lip-sync, sound effects, and music aligned to visual content.
Human Motion & Avatar Animation
4 papers
Generating realistic human body motion, co-speech gestures, sign language, and unified multi-task avatar animation from text, audio, or multimodal inputs.
Test-Time Compute Scaling
2 papers
Methods that allocate additional computation at inference time through search, evolutionary algorithms, or tree-based exploration to improve video quality without retraining.
Large-Scale Foundation Models & Architectures
5 papers
Scaling video generation models to tens of billions of parameters with novel architectures including flow matching transformers, autoregressive token-based approaches, and unified multimodal systems.
Evaluation Benchmarks & Quality Assessment
2 papers
New benchmarks and evaluation methodologies for assessing video generation quality, including object state changes, temporal hallucinations, and physical plausibility.
π‘ Key Insights
π‘ GRPO-based RL post-training improves video motion quality by up to 181% over prior RL methods
π‘ Adversarial post-training enables one-step video generation that surpasses multi-step diffusion quality
π‘ Test-time evolutionary search lets small models match 10Γ larger models without retraining
π‘ Physical plausibility remains a fundamental gap despite achieving high visual fidelity scores
π‘ Multi-scene narrative video now spans 20+ coherent shots and 3 minutes of consistent content
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has converged on GRPO-based reinforcement learning as the dominant post-training paradigm while branching into two complementary directions: ultra-fast generation via adversarial one-step methods and ultra-high-quality generation via test-time scaling. Increasingly, video generation is being applied beyond content creation to embodied AI, world modeling, and automated professional production.
- (Control-A-Video, 2023) introduced motion-adaptive noise priors and spatio-temporal reward feedback for controllable text-to-video generation
- (AnimateDiff, 2023) demonstrated plug-and-play motion modules that animate any personalized text-to-image model without model-specific tuning
- HiP (Compositional Foundation Models for Hierarchical Planning, 2023) pioneered iterative refinement across language, video, and action foundation models for long-horizon embodied planning
- Drive-WM (Driving into the Future, 2023) introduced the first multiview driving world model compatible with end-to-end planners, achieving 3.65 FID on nuScenes
- T2(T2V-Turbo, 2024) broke the consistency model quality bottleneck by integrating mixed spatial-temporal reward feedback during distillation, achieving >10Γ inference acceleration
- VADER (Video Diffusion Alignment via Reward Gradients, 2024) pioneered backpropagating differentiable reward gradients through video denoising on consumer hardware (16GB VRAM)
- (Movie Gen, 2024) scaled flow matching transformers to 30B parameters for 1080p HD video with integrated audio, editing, and personalization capabilities
- T2V-Turbo-v2 (T2V-Turbo-v2, 2024) achieved 85.13 VBench Total Score SOTA by combining offline motion guidance with multi-reward consistency distillation, surpassing Gen-3 and Kling
- (DOLLAR, 2024) combined variational score and consistency distillation with latent reward models for 278.6Γ inference speedup
- (Large Motion Model, 2024) consolidated 16 motion datasets into the MotionVerse benchmark with 320k sequences for unified multi-task motion generation
π Shift from supervised training to reward-based post-training β models began using RL and differentiable rewards to align video outputs with human preferences, moving beyond simple likelihood optimization.
- (Diffusion Adversarial Post-Training, 2025) achieved one-step 1280Γ720 video generation at real-time speed by training directly against real data with a 16B-parameter GAN
- (DanceGRPO, 2025) established the foundational GRPO framework for visual generation, outperforming DDPO/DPOK by up to 181% on motion quality
- (LCT, 2025) expanded single-shot models to generate coherent 20-shot, 3-minute narrative videos via attention window expansion
- (Autoregressive Adversarial Post-Training, 2025) enabled real-time 24fps interactive streaming of 1-minute consistent videos on a single H100
- (RewardDance, 2025) scaled generative reward models to 26B parameters with Chain-of-Thought reasoning, drastically reducing reward hacking
- (EvoSearch, 2025) introduced test-time evolutionary search where a 1.3B model matches performance of the 10Γ larger 14B model
- (Video Alchemist, 2025) achieved +23.2% subject similarity improvement in open-set multi-subject video personalization without test-time optimization
- Seedance 1.5 (Seedance 1.5 Pro, 2025) demonstrated native joint audio-visual generation with RLHF, achieving >10Γ inference speedup via multi-stage distillation
π Group Relative Policy Optimization (GRPO) became the dominant RL paradigm for video generation post-training, while test-time compute scaling emerged as a complementary training-free approach to improve quality.
- (PlayWorld, 2026) demonstrated autonomous robot play for world model training, improving real-world policy success rates by 65%
- (COMIC, 2026) achieved fully automated sketch comedy production using multi-agent iterative competition with engagement-calibrated critics
- (FlashMotion, 2026) solved trajectory control in few-step distilled generators via three-stage hybrid adapter tuning (FID 14.35 in 4 steps)
- (OSCBench, 2026) introduced systematic evaluation of object state change understanding across 1,120 prompts and 6 SOTA models
- (MaDiS, 2026) achieved state-of-the-art sign language generation with masked diffusion, reducing inference latency by ~30% over autoregressive baselines
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Reward-Aligned RL Post-Training | Group Relative Policy Optimization (GRPO) uses group-based advantage estimation from reward models to stabilize RL training of visual generators without a separate value network. | DanceGRPO improves over DDPO/DPOK by +181% on VideoAlign motion quality benchmarks; T2V-Turbo-v2 achieves 85.13 VBench Total Score, surpassing Gen-3 (82.32) and Kling (81.85) | DanceGRPO (2025), RewardDance (2025), Video Diffusion Alignment via Reward... (2024), T2V-Turbo-v2 (2024), PhysCorr (2025) |
| Adversarial Post-Training & Fast Distillation | Adversarial Post-Training (APT) trains a generator directly against real data using a GAN objective, abandoning teacher-student distillation entirely for one-step video generation. | APT surpasses 25-step diffusion baseline by +32.3% in visual fidelity preference for one-step generation; AAPT achieves real-time 24fps at 736Γ416 on a single H100 GPU | Diffusion Adversarial Post-Training for One-Step... (2025), Autoregressive Adversarial Post-Training for Real-Time... (2025), DOLLAR (2024), FlashMotion (2026) |
| Test-Time Compute Scaling for Video | Reformulates video denoising as a search problem where evolutionary algorithms mutate and evolve latent states to discover high-quality generation paths. | EvoSearch with Wan 1.3B achieves competitive performance with the 10Γ larger Wan 14B model; Video-T1's Tree-of-Frames reduces scaling cost compared to random linear search | Scaling Image and Video Generation... (2025), Video-T1 (2025) |
| Long Context & Multi-Scene Narrative Generation | Long Context Tuning (LCT) expands the attention window of single-shot models to process all shots simultaneously with interleaved 3D positional embeddings and asynchronous diffusion timesteps. | LCT generates coherent 20-shot, 3-minute videos from single-shot models; InfLVG extends generation length by 9Γ over standard autoregressive baselines while maintaining consistency | Long Context Tuning (2025), Automated Movie Generation via Multi-Agent... (2025), InfLVG (2025), COMIC (2026) |
| Video World Models for Embodied Intelligence | World Model-based Policy Optimization (WMPO) replaces real-world RL rollouts with imagined video trajectories from a pixel-space diffusion backbone, enabling safe on-policy learning. | PlayWorld improves real-world robotic policy success rates by 65% over pre-trained policies; Drive-WM achieves 3.65 FID on nuScenes, outperforming DriveDreamer (5.21 FID) | WMPO (2025), PlayWorld (2026), Driving into the Future: Multiview... (2023), Reinforcement Learning with Inverse Rewards... (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| VBench | Total Score (percentage, higher is better) | 85.13% | T2V-Turbo-v2 (2024) |
| VideoAlign Motion Quality | Motion Quality Score (relative improvement over baselines) | +181% over baselines | DanceGRPO (2025) |
| nuScenes Video Generation | FID (FrΓ©chet Inception Distance, lower is better) | 3.65 FID | Driving into the Future: Multiview... (2023) |
| One-Step Video Generation Preference | Human Preference Rate (percentage preferring one-step over multi-step) | +32.3% preference over 25-step baseline | Diffusion Adversarial Post-Training for One-Step... (2025) |
β οΈ Known Limitations (4)
- Physical plausibility violations: Generated videos frequently break fundamental physics laws (gravity, object permanence, fluid dynamics) despite impressive visual quality, limiting deployment in simulation and robotics domains (affects: Reward-Aligned RL Post-Training, Adversarial Post-Training & Fast Distillation)
Potential fix: Physics-specific reward models (PhysicsRM) and synthetic physics datasets for targeted fine-tuning; PISA shows as few as 5,000 synthetic samples can teach pre-trained models specific physical behaviors like gravity - Reward hacking and Goodhart's Law: Sustained RL optimization causes reward models to lose fidelity as quality proxies, with models exploiting shortcuts (improving one metric at the expense of others) and reward scores saturating within a few hundred training steps (affects: Reward-Aligned RL Post-Training)
Potential fix: TaRoS dynamically rebalances reward components based on intra-group discriminative ability; RewardDance scales reward models to 26B parameters with CoT reasoning to maintain signal quality and reduce hacking - Identity blending in multi-subject personalization: When multiple reference identities are provided, attributes from different subjects frequently merge into composite characters, especially for same-gender or same-race pairs, undermining practical personalization (affects: Long Context & Multi-Scene Narrative Generation, Reward-Aligned RL Post-Training)
Potential fix: Anchored prompts with concept-specific embeddings explicitly link each reference image to its text entity (Movie Weaver); identity-preserving reward models trained on human preference data provide feedback for GRPO-based alignment (Identity-GRPO) - Computational cost of RL post-training: GRPO-based methods require generating multiple candidate videos per prompt for advantage estimation, creating significant GPU memory and time overhead that limits scalability to larger prompt sets and higher resolutions (affects: Reward-Aligned RL Post-Training, Test-Time Compute Scaling for Video)
Potential fix: Bayesian prior-guided optimization (BPGO) filters noisy reward signals to converge faster with fewer samples; trajectory alignment with memory banks (TAGRPO) avoids expensive re-generation by reusing past high/low reward trajectories
π View major papers in this topic (10)
- DanceGRPO: Unleashing GRPO on Visual Generation (2025-05) 9
- Diffusion Adversarial Post-Training for One-Step Video Generation (2025-01) 9
- Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation (2025-06) 9
- T2V-Turbo-v2: Enhancing Video Generation Model Post-Training (2024-10) 9
- Movie Gen: A Cast of Media Foundation Models (2024-10) 9
- RewardDance: Reward Scaling in Visual Generation (2025-09) 9
- Scaling Image and Video Generation via Test-Time Evolutionary Search (2025-05) 8
- Long Context Tuning (2025-03) 8
- PlayWorld: Learning Robot World Models from Autonomous Play (2026-03) 8
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning (2023-07) 8
π‘ Within the same paradigm, another important research direction focuses on Image Editing.
Image Editing
What: Image editing research develops methods to modify existing images based on text instructions, reference images, or user intent while preserving unedited content and identity.
Why: Enabling intuitive visual content manipulation empowers creators and everyday users to realize creative visions without tedious manual pixel-level work.
Baseline: Standard diffusion models generate images from text prompts but lack precise control over local edits, identity preservation, and complex multi-attribute modifications.
- Preserving unedited regions and subject identity while applying targeted semantic modifications to specific areas
- Handling complex multi-object instructions with correct attribute binding, spatial reasoning, and style consistency
- Building reliable automated reward signals to train and evaluate editing quality at scale
π§ͺ Running Example
Baseline: A standard diffusion model would regenerate the entire image from the text description, losing the dog's specific identity, fur pattern, and producing inconsistent edits across regions.
Challenge: This example requires three simultaneous capabilities: (1) preserving the dog's identity (personalization), (2) changing pose and background independently (disentangled editing), and (3) ensuring spatial coherence between foreground subject and new background (compositional reasoning).
π Overall Progress
Image editing research has evolved from parameter-heavy per-subject fine-tuning (2023) through training-free personalization and LLM-driven compositional reasoning (2024) to unified multimodal frameworks that jointly handle generation, editing, and composition with RL-optimized reward signals (2025β2026). A key paradigm shift was the integration of reward models as first-class components, enabling automated quality evaluation that rivals human experts. The field has converged toward systems where reasoning, generation, and self-correction operate as coordinated agents rather than isolated pipelines.
π Sub-topics
Reward-Driven Image Editing
4 papers
Methods that use reward models, reinforcement learning, and preference learning to improve instruction following, detail preservation, and generation quality in image editing systems.
Subject & Style Personalization
9 papers
Techniques for generating and editing images that preserve specific subject identity, artistic style, or visual attributes from reference images, including both fine-tuning-based and training-free approaches.
Reasoning-Guided & Interactive Editing
4 papers
Approaches that leverage multimodal LLM reasoning, chain-of-thought planning, and interactive user interfaces to handle complex compositional edits and improve prompt engineering for image generation.
Architecture Adaptation & Efficient Editing
5 papers
Research on adapting modern diffusion transformer architectures (MM-DiT) for editing tasks, accelerating inference through sparse parameterization, and building unified generation-editing frameworks.
Image Restoration & Enhancement
4 papers
Diffusion-based methods for super-resolution, denoising, medical image reconstruction, and coarse-to-fine visual refinement that restore or enhance image quality under degradation.
Adversarial Robustness of Diffusion Models
1 papers
Research on understanding and exploiting vulnerabilities in text-to-image diffusion models through multi-modal adversarial attacks that manipulate generated content.
π‘ Key Insights
π‘ Decomposed reward signals dramatically improve instruction following and detail preservation in editing
π‘ Training-free personalization now matches or exceeds fine-tuning-based methods in identity fidelity
π‘ LLM chain-of-thought reasoning unlocks compositional generation that direct text-to-image mapping cannot achieve
π‘ MM-DiT attention heads naturally specialize for different semantics, enabling targeted editing without retraining
π‘ Unified generation-editing frameworks outperform separate specialized models on both tasks simultaneously
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressively moved from isolated editing capabilities toward unified, reward-driven, and reasoning-guided systems. Early work focused on efficient personalization; mid-period work introduced compositional planning via LLMs; recent work emphasizes self-improving agentic systems with internalized reward models that enable continuous quality improvement.
- (SVDiff, 2023) introduced spectral shift fine-tuning, reducing personalization checkpoints from 3.66GB to 1.7MB while improving multi-subject disentanglement
- HiPer (Highly Personalized Text Embedding for..., 2023) demonstrated that decomposing text embeddings into semantic head and personalized tail enables single-image personalization in 3 minutes
- (ProSpect, 2023) discovered that diffusion denoising stages correspond to visual attributes in frequency order (layout β content β style)
- Null-text cartoonization (Null-text Guidance is Secretly a..., 2023) revealed that perturbing the null-text branch in Classifier-Free Guidance produces cartoon stylization without any training
- (Mastering Text-to-Image Diffusion, 2024) pioneered using MLLM chain-of-thought reasoning as a global planner for compositional generation with Complementary Regional Diffusion
- (Joint-Image, 2024) eliminated fine-tuning entirely by learning joint image distributions with coupled self-attention, outperforming even DreamBooth
- (Multi-Reward, 2024) introduced quality-aware conditioning that decomposes reward into instruction following, detail preserving, and generation quality
- (HeadRouter, 2024) discovered semantic specialization of attention heads in MM-DiTs, enabling training-free text-guided editing on next-generation architectures
π Shift from single-concept fine-tuning toward training-free personalization and LLM-driven compositional planning for complex multi-object edits.
- Seedream 4.0 (Seedream 4.0, 2025) unified T2I synthesis, editing, and multi-image composition in a single framework, ranking 1st on Artificial Analysis Arena and outperforming GPT-Image-1
- (EditScore, 2025) established a rigorous reward benchmark and fine-tuned VLM-based reward models that surpass GPT-4o/5, enabling stable online RL for editing
- (Generation Chain-of-Thought, 2025) introduced explicit language reasoning before pixel generation with a Semantic-Spatial Guidance Module for unified generation and editing
- (Sparse-LaViDa, 2025) achieved 2.83Γ speedup on editing tasks while improving accuracy through sparse token processing with step-causal masking and KV-caching
π Transition from separate generation and editing pipelines to unified multimodal systems that jointly handle T2I, editing, and composition within a single model.
- Joint Reward Modeling (Internalizing Chain-of-Thought for Efficient Visual..., 2026) introduced Latent CoT that internalizes generative reasoning into efficient discriminative scoring, achieving 85.1% on EditReward-Bench and surpassing GPT-5 by 9.6%
- (Self-Improving, 2026) proposed Theory-of-Mind inspired self-improving agents with experience-driven memory, improving GenAIBench VQA Score by +8.73% over prior agentic systems
- (Quality-Aware, 2026) leveraged VLM-generated quality descriptions and pixel-wise uncertainty maps to achieve state-of-the-art restoration, reducing FID by 16.74 on DRealSR
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Reward-Conditioned Editing | Decompose editing quality into explicit reward dimensions (instruction following, detail preservation, generation quality) and condition or optimize the editor against them. | EditScore-72B achieves 86.36% accuracy on EditReward-Bench, surpassing GPT-4o (84.41%) and GPT-5 (85.29%); MRC improves InsPix2Pix by +9.4% Instruction Following on Real-Edit. | EditScore (2025), Joint Reward Modeling (2026), Multi-Reward (2024), OneReward (2025) |
| Training-Free Personalization | Inject reference image features into the diffusion process at inference time through attention manipulation or joint distribution modeling, eliminating the need for per-subject training. | JeDi outperforms fine-tuning-based DreamBooth in CLIP-I and DINO subject fidelity scores on the DreamBooth dataset while requiring zero test-time optimization. | JeDi (2024), RB-Modulation (2024), FreeTuner (2024) |
| Reasoning-Guided Visual Generation | Use multimodal LLMs as planners that reason about spatial relationships and semantic structure before delegating sub-regions or editing steps to diffusion models. | SIDiffAgent achieves +8.73% VQA Score on GenAIBench over T2I-Copilot and +5.36% over proprietary Imagen 3. | Mastering Text-to-Image Diffusion (2024), GoT (2025), SIDiffAgent (2026) |
| MM-DiT Attention Manipulation | Decompose MM-DiT's joint attention into functional blocks and selectively modify image input projections or route text guidance to semantically sensitive heads for localized edits. | Input projection editing achieves robust editing across 5 MM-DiT variants (SD3, SD3.5, Flux.1) while maintaining inference speed within 2% of standard generation (15.2s vs 14.9s). | HeadRouter (2024), Exploring Multimodal Diffusion Transformers for... (2025) |
| Parameter-Efficient Diffusion Personalization | Fine-tune minimal parameter subsets (spectral shifts, embedding tails, prompt spectra, or decoupled identity modules) to capture subject identity without full model retraining. | SVDiff reduces checkpoint size to ~1.7MB per subject (vs. 3.66GB for DreamBooth, a ~2,200Γ reduction) while achieving 60.9% user preference over full-weight fine-tuning for multi-subject generation. | SVDiff (2023), Highly Personalized Text Embedding for... (2023), ProSpect (2023), Infinite-ID (2024), PALP (2024) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| EditReward-Bench | Accuracy (%) | 86.36% | EditScore (2025) |
| GEdit-Bench | Editing Success Rate (%) | +14.6% improvement over base OmniGen2 | EditScore (2025) |
| GenAIBench | VQA Score | +8.73% over T2I-Copilot | SIDiffAgent (2026) |
| DRealSR | FID (FrΓ©chet Inception Distance, lower is better) | State-of-the-art across all metrics | QUSR (2026) |
| Artificial Analysis Arena | Arena Ranking | Rank 1st in both single-image editing and T2I tracks | Seedream 4.0 (2025) |
β οΈ Known Limitations (4)
- Reward model accuracy ceiling β even the best reward models (86.36%) disagree with human experts ~14% of the time, potentially misguiding RL optimization toward non-human-aligned outputs (affects: Reward-Conditioned Editing)
Potential fix: Joint training of discriminative and generative reward objectives (Latent CoT) improves reasoning capabilities; self-ensembling reduces variance in reward estimates - Identity-text entanglement β most methods face a trade-off between preserving subject identity fidelity and adhering to complex text prompts describing new contexts or styles (affects: Training-Free Personalization, Parameter-Efficient Diffusion Personalization)
Potential fix: Explicit ID-semantics decoupling via separate attention modules (Infinite-ID) or score distillation to prevent prompt forgetting (PALP) - Computational cost of reasoning-guided methods β LLM-based planning adds significant latency (multiple LLM inference calls per image) and may not scale to real-time interactive editing applications (affects: Reasoning-Guided Visual Generation)
Potential fix: Experience-driven memory caching (SIDiffAgent) reduces redundant planning; adversarial acceleration (Seedream 4.0) enables 1.4s generation at 2K resolution - Limited generalization of attention manipulation to new architectures β methods designed for specific MM-DiT variants may not transfer to future architectures with different attention patterns (affects: MM-DiT Attention Manipulation)
Potential fix: Instance-adaptive routing (HeadRouter) and block selection strategies generalize across 5+ MM-DiT variants without model-specific tuning, suggesting architecture-agnostic principles may exist
π View major papers in this topic (10)
- Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models (2026-02) 9
- EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling (2025-09) 9
- Seedream 4.0: Toward Next-generation Multimodal Image Generation (2025-09) 9
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation (2024-06) 8
- Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (2024-01) 8
- SVDiff: Compact Parameter Space for Diffusion Fine-Tuning (2023-03) 8
- Sparse-LaViDa: Efficient Masked Discrete Diffusion for Multimodal Generation and Understanding (2025-01) 8
- GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing (2025-03) 8
- SIDiffAgent: Self-Improving Diffusion Agent (2026-02) 8
- QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model (2026-03) 8
π‘ Within the same paradigm, another important research direction focuses on Unified Understanding and Generation.
Unified Understanding and Generation
What: Research on building single models that jointly understand and generate content across multiple modalities (text, image, audio) within a shared architecture.
Why: Separate understanding and generation models are inefficient, miss cross-modal synergies, and cannot leverage comprehension to improve generation quality.
Baseline: Pipeline approaches using separate specialized models for understanding (e.g., LLaVA for vision-language) and generation (e.g., Stable Diffusion for images).
- Bridging the cognitive gap between understanding and generation within shared model parameters
- Preventing task conflict and quality degradation when jointly training on multiple modalities
- Maintaining generation quality and coherence in long interleaved multi-modal sequences
π§ͺ Running Example
Baseline: A pipeline approach would use a vision model to caption the fox photo and a separate text-to-image model to generate the scene, but the generated fox would lose its distinctive features and the scene composition would ignore visual context from the reference photo.
Challenge: This example requires understanding the reference image (fox appearance), reasoning about the text description (scene layout, moonlight), and generating a coherent image β all within one model. It also illustrates the cognitive gap: the model may 'understand' the fox but fail to translate that understanding into generation-friendly features.
π Overall Progress
The field has progressed from separate understanding and generation models to unified architectures that handle both within shared parameters. Key paradigm shifts include the move from modality-specific designs to modality-agnostic diffusion and autoregressive frameworks, and from direct generation to reasoning-then-generating approaches. Recent work addresses frontier challenges like long-horizon coherence and speech integration, pushing toward truly omni-modal systems.
π Sub-topics
Unified Model Architectures
5 papers
Architectural approaches for building single models that handle both understanding and generation, including autoregressive, diffusion-based, and flow-based designs with strategies to prevent task conflict in shared parameters.
Reasoning-Enhanced Generation
4 papers
Methods that introduce explicit reasoning steps β chain-of-thought, self-evaluation, reinforcement learning β before or during generation to improve instruction adherence, subject fidelity, and compositional accuracy.
Interleaved and Omni-Modal Generation
2 papers
Research on generating long interleaved text-image sequences and integrating additional modalities such as speech, addressing quality collapse in extended generation and efficient multi-modal fusion.
π‘ Key Insights
π‘ Explicit reasoning before generation improves instruction adherence by 89β160%
π‘ Visual history actively pollutes long-horizon generation after ~20 discrete images
π‘ Decoupled routing prevents understanding-generation task conflict in unified models
π‘ Diffusion models can match autoregressive LLMs on reasoning with RL post-training
π‘ Complementary text-image reasoning outperforms redundant cross-modal descriptions
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from foundational joint modeling (2023) through architectural innovations for multi-modal fusion (2024β2025) to reasoning-enhanced and reliability-focused generation (2025β2026), with increasing emphasis on explicit chain-of-thought and reinforcement learning to bridge the understanding-generation gap.
- (Multi-Modal, 2023) pioneered decoupled encoding with shared diffusion modeling, resolving the coherence-quality tradeoff that plagued multi-modal VAEs by achieving 85.22% joint coherence
- (Lumina-mGPT, 2024) demonstrated that pure autoregressive decoder-only models can match diffusion model quality for photorealistic generation through Flexible Progressive Supervised Fine-tuning, training a versatile 7B model in just 7 days
π Shift from modality-specific models to unified architectures that jointly model multiple modalities under shared probabilistic frameworks.
- (OmniFlow, 2024) extended rectified flows to any-to-any generation across text, image, and audio with novel multi-modal guidance and model merging for stable training
- (Lyra, 2024) integrated speech into multimodal LLMs through latent cross-modality regularization and dynamic token reduction, enabling multi-hour speech processing
- (Mogao, 2025) introduced decoupled QKV/FFN routing and Efficient Complete Teacher Forcing for causal interleaved multi-modal generation, achieving 83.3% on MME perception
- (MMaDA, 2025) achieved the first fully modality-agnostic diffusion model with UniGRPO reinforcement learning, outperforming autoregressive LLMs on reasoning benchmarks while excelling at image generation
- MM-R1 (MM-R1, 2025) applied cross-modal chain-of-thought with GRPO reinforcement learning for zero-shot personalized generation without subject-specific fine-tuning
- (ImageGen-CoT, 2025) introduced structured reasoning before image generation with hybrid scaling, improving CoBSAT scores by 89% and DreamBench++ by 114%
- (ThinkMorph, 2025) established complementary interleaved reasoning where text and image thoughts advance problem-solving synergistically, enabling a 7B model to surpass 38B models on spatial reasoning
- (SEER, 2026) developed self-evolving cognitive alignment using only 300 seed samples, proving that optimizing reasoning outperforms optimizing pixel-level execution
- (UniLongGen, 2026) identified the event bottleneck in long-horizon generation β quality collapses after ~20 visual events β and proposed training-free layer-split visibility to sustain generation fidelity
π Shift from direct generation to reasoning-then-generating paradigms where models explicitly plan and reason before producing visual output.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Autoregressive Multimodal Pretraining | Initialize from strong multimodal bases and use decoupled routing or progressive fine-tuning to prevent understanding-generation task conflict. | Mogao achieves 83.3% on MME perception, surpassing Emu2 by +5.0pp (78.3%) and Mantis-8B by +2.7pp (80.6%); Lumina-mGPT matches SD3 and DALL-E 3 image quality using a pure autoregressive approach. | Lumina-mGPT (2024), Mogao (2025) |
| Unified Diffusion-Based Generation | Model all modalities as discrete tokens or latent vectors under a unified denoising process with modality-agnostic architecture. | MLD achieves 85.22% joint coherence on MNIST-SVHN, improving over MVTCAE by +36pp; OmniFlow achieves 1.79 FAD (FrΓ©chet Audio Distance) for audio, improving over AudioMAE baseline of 2.03; MMaDA surpasses LLaMA-3-7B on GSM8K and MATH reasoning despite being a diffusion model. | Multi-Modal (2023), OmniFlow (2024), MMaDA (2025) |
| Chain-of-Thought Image Generation | Generate structured textual reasoning (chain-of-thought) prior to image synthesis, with complementary text-image thoughts advancing reasoning synergistically. | ImageGen-CoT improves SEED-X by +89% on CoBSAT, achieving 0.909 (from 0.349 baseline), and +114% on DreamBench++, achieving 0.543 (from 0.188); ThinkMorph achieves +85.84% accuracy on Spatial Navigation (VSP), surpassing InternVL3.5-38B on SAT reasoning (52.67% vs 49.33%); SEER outperforms Emu3 and Janus-Pro in instruction adherence using only 300 seed samples. | ImageGen-CoT (2025), MM-R1 (2025), ThinkMorph (2025), Endogenous Reprompting (2026) |
| Long-Horizon Context Curation | Generation fails based on discrete visual event count (~20 images), not token length; layer-split attention separates text grounding from image synthesis. | Demonstrates that 150k text tokens maintain high fidelity while 150k image tokens (~30 images) cause total collapse; UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency. | How Long Can Unified Multimodal... (2026) |
| Speech-Centric Omni-Cognition | Align speech tokens to text transcript embeddings in latent space and dynamically prune redundant tokens via attention-based similarity. | Achieves state-of-the-art across vision-language, vision-speech, and speech-language benchmarks compared to other omni-methods; compresses long speech to 300 tokens per segment for multi-hour processing. | Lyra (2024) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MME Perception | Accuracy (%) | 83.3% | Mogao (2025) |
| CoBSAT | Accuracy Score (0β1) | 0.909 | ImageGen-CoT (2025) |
| DreamBench++ | Score (0β1) | 0.543 | ImageGen-CoT (2025) |
| MNIST-SVHN Joint Coherence | Joint Coherence (%) | 85.22% | Multi-Modal (2023) |
| SAT Spatial Reasoning | Accuracy (%) | 52.67% | ThinkMorph (2025) |
β οΈ Known Limitations (4)
- Quality collapse in long interleaved sequences: accumulated visual tokens hijack attention, limiting practical applications like storybook or document generation beyond ~20 images (affects: Autoregressive Multimodal Pretraining, Chain-of-Thought Image Generation)
Potential fix: Layer-split visibility policies and context curation strategies that separate text grounding from image synthesis at different transformer layers (UniLongGen) - Cognitive gap between understanding and generation: models comprehend visual instructions but fail to translate that understanding into generator-friendly representations, causing instruction-following failures (affects: Autoregressive Multimodal Pretraining, Unified Diffusion-Based Generation)
Potential fix: Self-evolving reprompting mechanisms (SEER) that train models to generate self-aligned descriptors, optimizing reasoning prompts rather than pixel-level execution - Task conflict in shared parameters: joint training on understanding and generation degrades performance on one or both tasks due to gradient interference and competing optimization objectives (affects: Autoregressive Multimodal Pretraining, Unified Diffusion-Based Generation)
Potential fix: Decoupled QKV/FFN routing for separate task pathways (Mogao) or fully modality-agnostic architectures with task-specific RL fine-tuning (MMaDA) - Limited modality coverage: most unified models handle only text and images, with speech, audio, and video integration remaining underexplored and computationally expensive (affects: Autoregressive Multimodal Pretraining, Chain-of-Thought Image Generation)
Potential fix: Latent cross-modality regularization for speech alignment (Lyra) and multi-modal rectified flows for any-to-any generation (OmniFlow) that extend joint modeling to speech and audio
π View major papers in this topic (10)
- Multi-Modal Latent Diffusion (2023-06) 8
- Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining (2024-08) 8
- Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition (2024-12) 8
- Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (2025-05) 8
- MMaDA: Multimodal Large Diffusion Language Models (2025-06) 8
- ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning (2025-03) 8
- ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning (2025-10) 8
- Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models (2026-01) 8
- How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation (2026-03) 8
- OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows (2024-12) 7
π‘ Moving to the next paradigm, we turn to Other MM Topics.
Other MM Topics
What: A broad collection of multimodal research spanning RL-based reasoning for MLLMs, safety/adversarial robustness, audio-language models, medical imaging, fusion under missing modalities, benchmarks, and retrieval-augmented generation.
Why: Multimodal AI must integrate diverse signalsβvision, audio, text, sensorsβto build robust, safe, and generalizable systems for real-world deployment across domains.
Baseline: Standard MLLMs use post-hoc vision encoder adaptation onto frozen text LLMs, with supervised fine-tuning on curated image-text pairs for downstream tasks.
- Modality imbalance and interference degrade fusion when one signal dominates or is missing entirely
- Safety alignment fails across modalities, enabling jailbreaks via images, audio, or typography
- RL-based reasoning for multimodal models suffers from sparse rewards and overthinking on simple tasks
π§ͺ Running Example
Baseline: A standard MLLM processes the image at low resolution, ignores the prior scan, and generates a generic description with potential hallucinations about specific pathologies.
Challenge: This example requires longitudinal reasoning (comparing two images over time), fine-grained medical perception (detecting subtle changes), robust fusion of visual and clinical text, and safety alignment (avoiding harmful misdiagnoses).
π Overall Progress
The field has undergone two major paradigm shifts: from post-hoc modality adaptation to native multimodal training (2023), and from supervised fine-tuning to RL-based reasoning (2025). Foundation models now process interleaved text, images, audio, and video natively, while GRPO-based RL has become the dominant post-training method, enabling smaller models (7-9B) to match or surpass much larger ones (72B+). Safety research has evolved from isolated vulnerability studies to comprehensive three-dimensional evaluation frameworks.
π Sub-topics
RL-Based Multimodal Reasoning
65 papers
Methods applying reinforcement learningβparticularly Group Relative Policy Optimization (GRPO)βto enhance chain-of-thought reasoning, structured output, and generalization in multimodal LLMs across vision, video, audio, and time-series tasks.
Multimodal Safety, Attacks & Evaluation
40 papers
Research on jailbreak attacks targeting multimodal inputs (images, audio, typography), safety benchmarks for MLLMs, and adversarial robustness evaluation including backdoor attacks on model merging.
Multimodal Foundation Models & Benchmarks
80 papers
Large-scale natively multimodal models trained jointly on text, images, audio, and video, alongside frontier evaluation benchmarks that probe integrated capabilities and reasoning depth.
Audio-Language Understanding & Reasoning
30 papers
Models and benchmarks for audio question answering, sound reasoning via chain-of-thought, and addressing textual bias in large audio-language models across speech, music, and environmental sounds.
Multimodal Medical AI
55 papers
Foundation models and domain-adapted systems for medical image segmentation, diagnosis, report generation, and survival prediction across modalities including CT, MRI, PET, X-ray, and histopathology.
Multimodal Fusion & Missing Modality Robustness
60 papers
Methods addressing modality imbalance, dynamic fusion weighting, and robust inference when one or more modalities are absent at test time, spanning federated learning, segmentation, and recommendation.
Multimodal RAG & Retrieval
35 papers
Retrieval-augmented generation systems that integrate visual, textual, and document-level information for knowledge-intensive multimodal tasks, including poisoning attacks and efficiency optimizations.
π‘ Key Insights
π‘ RL-based reasoning enables 7-9B models to match or surpass 72B+ parameter models on complex tasks.
π‘ Vision-language safety alignment is fundamentally fragileβtypography and voice attacks bypass text-based filters.
π‘ Native multimodal pre-training outperforms post-hoc adaptation across all capability dimensions.
π‘ Audio models exhibit severe textual bias, dropping to near-0% accuracy when given misleading text.
π‘ Medical foundation models reduce expert annotation time by 80%+ while matching specialist performance.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from building individual multimodal models to creating agentic systems that reason, plan, and use tools across modalities, with an increasing emphasis on verifiable reasoning, deployment efficiency, and holistic safety evaluation.
- (Gemini, 2023) introduced natively multimodal joint training, scoring 90.04% on MMLUβfirst to exceed human experts.
- MedSAM (Segment Anything in Medical Images, 2023) adapted SAM to medical imaging with 1.5M image-mask pairs across 10 modalities, reducing annotation time by 82%.
- ShareGPT4V (ShareGPT4V, 2023) demonstrated that 1.2M high-quality descriptive captions dramatically improve MLLM alignment, gaining +36.1 points on MME.
- (MM-SafetyBench, 2023) revealed that embedding harmful text as images bypasses safety filters, raising ASR by 30%+.
- (MAWS, 2023) showed that self-supervised initialization before weakly supervised training scales to billions of images.
π Shift from post-hoc modality adaptation to natively multimodal joint training, exemplified by Gemini's interleaved training paradigm.
- (MathVerse, 2024) exposed that MLLMs treat diagrams as distractions rather than information, systematically isolating visual understanding gaps.
- (MM-Vet, 2024) pioneered capability integration evaluation, testing 16 combinations of six core vision-language skills.
- (BadMerging, 2024) revealed that model merging introduces novel backdoor vulnerabilities with >90% attack success rate.
- (EMAGE, 2023) introduced masked audio-gesture modeling with compositional quantization for holistic co-speech generation.
- (Time-MMD, 2024) established the first diverse multi-domain multimodal time series dataset with LLM-curated text alignment.
- Kimi k1.5 (Kimi k1.5, 2025) scaled RL context to 128k tokens, matching o1 on AIME without value functions or process reward models.
- InternVL3 (InternVL3, 2025) achieved 72.2 on MMMU with native multimodal pre-training, competitive with top proprietary models.
- (MobileRL, 2025) achieved 80.2% on AndroidWorld with difficulty-adaptive GRPO, surpassing 72B-parameter models with just 9B.
- Kimi K2.5 (Kimi K2.5, 2026) introduced Agent Swarm parallel orchestration, reducing latency by 4.5x for agentic tasks.
- Humanity's (HLE, 2025) revealed that frontier models achieve <15% accuracy on expert-level questions, exposing massive reasoning gaps.
- (AHELM, 2025) established the first holistic audio-language evaluation covering 10 diverse aspects including fairness and safety.
π Shift from supervised fine-tuning to reinforcement learning as the primary post-training paradigm for multimodal models, with GRPO becoming the dominant optimization method.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| RL-Based Multimodal Reasoning | Use outcome-based RL rewards to incentivize step-by-step reasoning in multimodal models, bypassing the need for annotated reasoning traces. | Improves on standard SFT by +22.1% accuracy on ScreenSpot (UI-R1) and matches 32B model performance with 8B parameters (ContextRL); MobileRL-9B achieves 80.2% on AndroidWorld vs 64.2% prior SOTA. | Kimi k1.5 (2025), SophiaVL-R1 (2025), MobileRL (2025), ContextRL (2026) |
| Multimodal Safety Attack & Defense | Exploit the vision-language connector as a weak point in safety alignment by encoding harmful content in images, audio, or typographic text. | VoiceJailbreak increases GPT-4o attack success rate from 0.033 to 0.778; Typography-based attacks raise ASR on LLaVA by 30%+ over text-only baselines; BadMerging achieves >90% ASR where prior methods fail at <20%. | Voice Jailbreak Attacks Against GPT-4o (2024), MM-SafetyBench (2023), BadMerging (2024), OmniSafeBench-MM (2025) |
| Native Multimodal Pre-training | Jointly acquire visual and linguistic capabilities during a single pre-training stage rather than retrofitting a text-only LLM with a vision encoder. | InternVL3-78B achieves 72.2 on MMMU, setting SOTA for open-source MLLMs; Gemini Ultra scores 90.04% on MMLU, first to exceed human-expert performance (89.8%); Kimi K2.5 achieves 86.4% on GPQA-Diamond. | Gemini (2023), InternVL3 (2025), Kimi K2.5 (2026), ShareGPT4V (2023) |
| Audio-Language Reasoning Models | Apply structured reasoning frameworks (CoT, GRPO) to audio inputs, training models to plan, caption, and reason before answering complex audio questions. | SARI achieves 67.08% on MMAU, +16.35% over Qwen2-Audio base; Omni-R1 reaches 71.3% MMAU SOTA; Audio Flamingo Sound-CoT achieves 79.83% on MMAU-Sound vs GPT-4o Audio at 63.20%. | SARI (2025), Audio Flamingo 2 (2025), Audio Flamingo Sound-CoT Technical Report (2025), AHELM (2025) |
| Universal Medical Multimodal Models | Fine-tune or adapt general-purpose foundation models on large-scale curated medical datasets to enable cross-modality and cross-task generalization with prompt-based interfaces. | MedSAM outperforms specialist U-Net by 15.5% on unseen nasopharynx cancer segmentation; PRISM fine-tuned on 10% data outperforms supervised baselines using 100% data; MedRAX surpasses GPT-4o alone on complex clinical reasoning. | Segment Anything in Medical Images (2023), PRISM (2024), MedRAX (2025), PathMem (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MMLU | Accuracy (%) | 90.04% | Gemini (2023) |
| MMMU | Accuracy (%) | 72.2% | InternVL3 (2025) |
| AndroidWorld | Success Rate (%) | 80.2% | MobileRL (2025) |
| MMAU (Multimodal Audio Understanding) | Accuracy (%) | 71.3% | Omni-R1 (2025) |
| MathVista | Accuracy (%) | 71.3% | SophiaVL-R1 (2025) |
β οΈ Known Limitations (4)
- RL training instability with multimodal data: GRPO suffers from sparse rewards and advantage vanishing on long-chain reasoning, especially with visual inputs where reward verification is harder than text. (affects: RL-Based Multimodal Reasoning (GRPO for MLLMs))
Potential fix: Difficulty-aware reward shaping, curriculum learning from easy to hard samples, and multi-turn sampling with mistake feedback (as in ContextRL). - Cross-modal safety alignment gap: Safety training on text does not transfer to visual or audio channels, leaving multimodal models fundamentally vulnerable to modality-specific attacks. (affects: Multimodal Safety Attack & Defense, Native Multimodal Pre-training)
Potential fix: Joint multimodal safety RL, adversarial training with cross-modal attacks, and three-dimensional evaluation frameworks that assess harmfulness beyond binary ASR. - Missing modality fragility: Most multimodal models degrade catastrophically when any input modality is absent at test time, despite real-world sensor failures being common. (affects: Universal Medical Multimodal Models, Audio-Language Reasoning Models)
Potential fix: Shared-specific feature decomposition, missing-modality augmentation during training, and federated pseudo-modality generation from frequency domain properties. - Benchmark saturation and evaluation gaps: Frontier models exceed 90% on popular benchmarks like MMLU, making it difficult to measure progress, while emerging benchmarks reveal <15% accuracy on truly hard problems. (affects: Native Multimodal Pre-training)
Potential fix: Expert-sourced frontier benchmarks with negative LLM filtering, process-level evaluation (PSAS), and cognitive-developmental benchmarks that probe core reasoning abilities.
π View major papers in this topic (10)
- Gemini: A Family of Highly Capable Multimodal Models (2023-12) 10
- InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (2025-04) 9
- Kimi K2.5: Visual Agentic Intelligence (2026-02) 9
- MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents (2025-09) 9
- Segment Anything in Medical Images (2023-04) 9
- MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? (2024-03) 9
- Humanity's Last Exam (2025-01) 9
- Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Agentic Capabilities (2025-07) 9
- BadMerging: Backdoor Attacks Against Model Merging (2024-08) 9
- AHELM: A Holistic Evaluation of Audio-Language Models (2025-09) 9
π‘ Shifting from core paradigms to cross-cutting themes, we examine GUI and Web Agents.
GUI and Web Agents
What: Research on autonomous agents that perceive graphical user interfaces via vision-language models and execute multi-step tasks through mouse, keyboard, and touch interactions.
Why: Automating GUI-based workflows can dramatically reduce human effort on repetitive digital tasks across mobile, desktop, and web environments.
Baseline: Supervised fine-tuning on static human-annotated trajectories, where a VLM predicts the next action from a screenshot and text instruction.
- Binary reward signals provide no gradient for near-miss clicks, making precise spatial grounding difficult to learn
- Static offline training fails to generalize to dynamic, stochastic real-world interfaces that change across apps and updates
- Long-horizon tasks compound errors across dozens of steps, where a single misclick can derail the entire workflow
π§ͺ Running Example
Baseline: A supervised fine-tuned agent processes the screenshot and predicts clicks sequentially. It may click near the date picker but miss the exact target (binary reward gives no feedback on 'close' misses), select the wrong date because it overfits to training UI layouts, and fail to recover after navigating to a dead-end page.
Challenge: This task requires precise grounding (clicking small date cells and dropdown menus), long-horizon planning (search β filter β select β confirm β pay), adaptation to unseen airline website layouts, and verification that the booking actually succeeded rather than just appearing to.
π Overall Progress
GUI agent research has undergone two major paradigm shifts in rapid succession. First, the move from text-based screen representations to direct visual perception (2023β2024) enabled zero-shot generalization across apps. Second, the shift from offline SFT to online RL with continuous rewards (2025) unlocked dramatic performance gains β small 7B models now routinely outperform 72B supervised baselines. The field is now entering a third phase focused on generalization, verification, and proactive agency.
π Sub-topics
RL Reward Design for GUI Grounding
12 papers
Papers that replace binary hit-or-miss reward signals with continuous, distance-based, or density-aware rewards to improve spatial precision in GUI element grounding through reinforcement learning.
Online Environment RL Training
5 papers
Approaches that train GUI agents through live interaction with real or emulated environments using online reinforcement learning, replacing static offline datasets with dynamic exploration and self-play.
Visual Grounding Architectures
5 papers
Novel model architectures that improve how agents localize and identify GUI elements, including attention-based coordinate-free methods, high-resolution dual-branch encoders, and zoom-in refinement strategies.
Agent Architectures and Planning Frameworks
8 papers
Multi-agent systems, planning-execution-reflection pipelines, and autonomous exploration strategies that enable GUI agents to handle complex, long-horizon workflows across mobile and desktop environments.
Benchmarks, Evaluation, and Security
9 papers
Evaluation frameworks, knowledge benchmarks, automated auditing methods, and adversarial attacks targeting GUI agents, including proactive intent prediction and efficiency-based backdoor attacks.
π‘ Key Insights
π‘ Continuous spatial rewards outperform binary hit/miss by 20β25 points in GUI grounding
π‘ Online RL in live environments yields 50%+ absolute gains over static supervised learning
π‘ Small 7B RL-trained models routinely surpass 72B supervised baselines on grounding benchmarks
π‘ Adversarial pop-ups achieve 86% attack success rate, exposing critical safety gaps in GUI agents
π‘ Proactive intent prediction from passive screen observation defines the next frontier for GUI agents
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from proving VLM feasibility for GUI tasks (2023) through architectural innovations for visual grounding (2024) to a dominant focus on reinforcement learning with sophisticated reward design (2025β2026), with emerging attention to safety, continual learning, and proactive agent behavior.
- (GPT-4V, 2023) demonstrated zero-shot GUI navigation using Set-of-Mark visual tagging, outperforming supervised Llama-2 by +24.6 points on AITW
- (CogAgent, 2023) introduced a dual-branch high-resolution cross-module for GUI understanding, achieving SOTA on AITW while reducing FLOPs by >50%
- (Mobile-Agent, 2024) built the first vision-centric autonomous mobile agent decoupling planning (GPT-4V) from localization (specialized visual tools)
- (DigiRL, 2024) pioneered online RL for device control with 64 parallel emulators, achieving 67.2% success on AITW β a +49.5% gain over SFT
π Transition from text-based screen parsing to direct visual perception of GUI screenshots using multimodal models.
- (MobileVLM, 2024) introduced graph-structured mobile pre-training with Mobile3M dataset, improving navigation by +34.2% over Qwen-VL-Max
- Pop-up attack study (Attacking Vision-Language Computer Agents via Pop-ups, 2024) revealed that adversarial pop-ups achieve 86% attack success rate against VLM agents
- (AgentTrek, 2024) demonstrated scalable trajectory synthesis from web tutorials at $0.55 per trajectory, improving WebArena success by +9.3%
- (InfiGUIAgent, 2025) proposed two-stage SFT with synthesized native reasoning including expectation-reflection loops for self-correction
- UI-R1 (UI-R1, 2025) showed that rule-based RL with just 136 samples can rival large-scale SFT models, gaining +22.1% on ScreenSpot
- GUI-G2 (GUI-G2, 2025) introduced Gaussian reward modeling that outperformed UI-TARS-72B by +24.7 points on ScreenSpot-Pro with a 7B model
- (GUI-Actor, 2025) eliminated coordinate prediction entirely via attention-based patch selection, outperforming UI-TARS-72B on ScreenSpot-Pro
- (MobileRL, 2025) achieved 80.2% on AndroidWorld with difficulty-adaptive GRPO, surpassing previous SOTA by +16 points
- (MiMo-VL, 2025) set new GUI grounding SOTA at 56.1 on OSWorld-G through mixed on-policy RL combining perception, grounding, and reasoning rewards
π Shift from binary success/failure rewards to continuous spatial rewards and from offline SFT to online RL in live environments.
- (Agentic Reward Modeling, 2026) introduced agentic verification where reward models actively probe the environment, improving evaluation accuracy to 92.9%
- BEPA (From Off-Policy to On-Policy, 2026) bridged expert framework systems and end-to-end agents via bi-level assimilation, reaching 32.1% on OSWorld-Verified
- (OSExpert, 2026) introduced GUI-DFS exploration for autonomous skill discovery, tripling long-horizon task success and closing 80% of the human efficiency gap
- (PIRA-Bench, 2026) defined the proactive intent recommendation paradigm where agents anticipate user goals from passive screen observation
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Continuous Spatial Reward RL | Model GUI elements as spatial distributions (Gaussian or distance-based) so near-miss clicks receive partial reward proportional to proximity. | Improves on UI-TARS-72B by +24.7 percentage points on ScreenSpot-Pro, achieving 47.5% accuracy (GUI-G2). SE-RFT achieves 47.3% with only 3,018 training samples, outperforming UI-TARS-72B by +24.2%. | GUI-G2 (2025), UI-R1 (2025), Enhancing Visual Grounding for GUI... (2025), UI-AGILE (2025), InfiGUI-G1 (2025), LPO (2025) |
| Online Agentic RL with GRPO | Deploy agents in parallel live environments with difficulty-adaptive curriculum and trajectory-level advantage scoring for long-horizon sparse rewards. | MobileRL achieves 80.2% success on AndroidWorld, improving over previous SOTA (64.2%) by +16.0 percentage points. DigiRL achieves 67.2% on AITW, a +49.5% absolute gain over SFT (17.7%). | DigiRL (2024), MobileRL (2025), ZeroGUI (2025), From Off-Policy to On-Policy: Enhancing... (2026), GUI-Libra (2026) |
| Coordinate-Free Visual Grounding | Use attention heads or zoom-in refinement to localize elements from visual features directly, bypassing text-based coordinate token generation. | GUI-Actor-7B achieves 44.6 on ScreenSpot-Pro, outperforming the 10Γ larger UI-TARS-72B (38.1) by +6.5 points. R-VLM improves grounding by +13% absolute over SeeClick across platforms. | CogAgent (2023), GUI-Actor (2025), R-VLM (2025), MiMo-VL (2025) |
| Agentic Verification and Pre-operative Critics | Empower verifier models with interactive capabilities to actively probe environment state rather than passively observing screenshots. | VAGEN improves evaluation accuracy from 84.7% (LLM-as-a-Judge) to 92.9% on OSWorld-Verified (+8.2%). GUI-Critic-R1 improves AndroidWorld success from 22.4% to 27.6% (+5.2%). | Agentic Reward Modeling (2026), Look Before You Leap: A... (2025), Guiding VLM Agents with Process... (2025) |
| Autonomous Data Synthesis for GUI Training | Replace expensive human-annotated trajectory collection with automated pipelines that convert web tutorials or random exploration into grounded training data. | AgentTrek achieves +9.3% task success on WebArena (22.4% vs 13.1% for GPT-4o) at $0.55/trajectory. GUI-Shift improves AndroidControl-High by +11.2% Exact Match over the base model. | AgentTrek (2024), GUI-Shift (2025), MobileGUI-RL (2025) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| ScreenSpot-Pro | Accuracy (%) | 47.5% | GUI-G2 (2025) |
| AndroidWorld | Success Rate (%) | 80.2% | MobileRL (2025) |
| OSWorld-Verified | Success Rate (%) | 32.13% | From Off-Policy to On-Policy: Enhancing... (2026) |
| Android-in-the-Wild (AitW) | Success Rate (%) | 67.2% | DigiRL (2024) |
| OSWorld-G (GUI Grounding) | Accuracy Score | 56.1 | MiMo-VL (2025) |
β οΈ Known Limitations (4)
- Generalization degrades sharply from seen to unseen applications β RL gains drop from 26% on familiar instances to only 8% on new apps, suggesting overfitting to specific UI patterns (affects: Online Agentic RL with GRPO, Continuous Spatial Reward RL)
Potential fix: Few-shot test-time adaptation, domain randomization across diverse UI layouts, and curriculum-based training spanning multiple app categories - Long-horizon desktop tasks remain largely unsolved with best success rates around 32%, as errors compound across dozens of sequential actions with no recovery mechanism (affects: Online Agentic RL with GRPO, Coordinate-Free Visual Grounding)
Potential fix: Hierarchical skill decomposition (as in OSExpert's GUI-DFS), process reward models for step-level correction, and modular planning with verified sub-goals - GUI agents are highly vulnerable to adversarial visual attacks β simple pop-up injections derail task completion in 86% of cases, and basic prompt-based defenses are ineffective (affects: Coordinate-Free Visual Grounding, Online Agentic RL with GRPO)
Potential fix: Adversarial training with injected distractors, element provenance verification, and safety-constrained action spaces that block interactions with unverified UI elements - Catastrophic forgetting when adapting to new apps β SFT overwrites old knowledge while RL struggles with sparse rewards in unfamiliar domains, requiring careful balancing of exploration and retention (affects: Continuous Spatial Reward RL, Online Agentic RL with GRPO)
Potential fix: Gradient surgery to project new learning onto conflict-free subspaces (CGL), entropy-guided SFT warmup with RL consolidation, and experience replay buffers
π View major papers in this topic (10)
- MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents (2025-09) 9
- DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (2024-06) 9
- CogAgent: A Visual Language Model for GUI Agents (2023-12) 9
- GUI-G2: Gaussian Reward Modeling for GUI Grounding (2025-07) 8
- Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction (2026-01) 8
- GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents (2025-06) 8
- AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (2024-12) 8
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (2023-11) 8
- OSExpert: Computer-Use Agents Learning Professional Skills via Exploration (2026-03) 8
- MiMo-VL Technical Report (2025-06) 8
π‘ Another cross-cutting theme examines Remote Sensing and Geospatial.
Remote Sensing and Geospatial
What: Research on adapting vision-language and foundation models to interpret satellite, aerial, and Earth observation imagery for tasks like classification, segmentation, object detection, and spatial reasoning.
Why: Earth observation data is critical for disaster response, environmental monitoring, urban planning, and agriculture, yet general-purpose AI models struggle with its unique overhead perspectives and domain-specific semantics.
Baseline: Standard vision-language models like CLIP or LLaVA, pretrained on internet-scale natural images, applied directly to remote sensing tasks via zero-shot transfer or basic fine-tuning.
- Severe scarcity of large-scale, annotated image-text datasets for satellite and aerial imagery domains
- Unique visual characteristics including bird's-eye perspectives, extreme scale variation, and tiny objects in massive pixel spaces
- Multi-modal heterogeneity across optical, SAR, infrared, and temporal data sources with alignment and fusion difficulties
π§ͺ Running Example
Baseline: A standard VLM (e.g., GPT-4V or LLaVA) struggles with overhead views of buildings, cannot reliably count damaged structures (RΒ² = 0.10 for destruction counting), confuses rubble with construction sites, and lacks the temporal reasoning to compare pre- and post-disaster imagery.
Challenge: This example illustrates all three key challenges: (1) no large captioned disaster-imagery dataset exists for training, (2) damaged buildings appear as tiny irregular patches from orbit requiring fine-grained spatial reasoning, and (3) combining optical and SAR imagery (which can see through clouds during storms) requires multi-modal fusion.
π Overall Progress
The field has progressed from lacking any large-scale RS image-text data (pre-2023) to having multiple million-scale datasets and billion-parameter foundation models. A major paradigm shift occurred around 2025 with reinforcement learning replacing supervised fine-tuning as the dominant adaptation strategy, enabling few-shot domain transfer. Concurrently, the field has moved from perception-only models toward agentic systems capable of multi-step spatial reasoning, tool use, and real-time drone navigation.
π Sub-topics
Remote Sensing Vision-Language Model Adaptation
12 papers
Methods for adapting general-purpose VLMs to remote sensing domains, including novel training data pipelines, annotation-free alignment strategies, and parameter-efficient fine-tuning techniques that bridge the domain gap between internet imagery and Earth observation data.
Reinforcement Learning-Enhanced Reasoning for RS
8 papers
Applying reinforcement learning with verifiable rewards (RLVR) and group relative policy optimization (GRPO) to unlock and strengthen reasoning capabilities in remote sensing VLMs, especially in few-shot and resource-constrained settings.
Benchmarks, Datasets, and Evaluation
11 papers
Construction of large-scale datasets and comprehensive benchmarks that expose the gap between general VLM capabilities and geospatial domain requirements, spanning scene classification, counting, change detection, and cartographic reasoning.
Aerial and UAV Navigation and Tracking
7 papers
Leveraging VLMs for drone-based vision-and-language navigation, aerial object search, and multi-object tracking from UAV platforms, addressing challenges of real-time control, 3D spatial reasoning, and motion blur.
Multi-Modal Earth Observation and Foundation Models
15 papers
Large-scale self-supervised pretraining across multiple Earth observation modalities (optical, SAR, LiDAR, hyperspectral, climate data) and specialized applications including multi-modal fusion for detection, segmentation, hyperspectral unmixing, and geospatial intelligence.
π‘ Key Insights
π‘ One training example with RL rewards can match thousands of supervised annotations for satellite VLMs
π‘ Ground-level internet photos effectively bridge the satellite-to-language annotation gap
π‘ Mixture of Experts prevents interference between image, region, and pixel-level RS tasks
π‘ Text-only domain knowledge cold-start dramatically improves subsequent visual RL performance
π‘ Spatial grounding outperforms text-based action prediction for drone navigation by 65+ points
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from constructing foundational datasets and adapting pretrained VLMs (2023) through benchmarking gaps and multi-modal architectures (2024) to RL-driven reasoning, ultra-high-resolution understanding, and deployable agentic systems for aerial navigation and disaster response (2025β2026).
- (RS5M, 2023) constructed the first 5-million-pair RS image-text dataset using filtered web data and generated captions
- (SkyScript, 2023) mined 2.6 million pairs from OpenStreetMap with 29,000 semantic tags, two orders of magnitude richer than prior datasets
- (Ground Remote Alignment, 2023) demonstrated annotation-free VLM training by using ground photos as a semantic bridge, outperforming supervised VLMs by 20%
- (GeoChat, 2023) established the first grounded conversational RS VLM with task-specific tokens and 318k instruction pairs
- (SkySense, 2023) introduced the first billion-scale multi-modal RS foundation model with factorized spatiotemporal encoding, achieving SOTA on all 16 benchmarks
π Shift from small, manually annotated RS datasets to million-scale automatically constructed image-text pairs using geographic metadata, enabling the first zero-shot VLMs for remote sensing.
- The GPT-4V Earth Observation benchmark (Good at captioning, bad at counting, 2024) revealed that frontier VLMs fail catastrophically on counting and change detection in satellite imagery
- (MMEarth, 2024) created a 1.2-million-location, 12-modality pretraining corpus and proposed Multi-Pretext MAE for geospatial representation learning
- (RSUniVLM, 2024) and (RS-MoE, 2024) introduced Mixture-of-Experts architectures for multi-granularity RS understanding
- (GEOBench-VLM, 2024) established a 31-task benchmark showing the best model achieves only 41.7% accuracy, highlighting the geospatial domain gap
- SM3(SM3Det, 2024) unified multi-modal detection across RGB, SAR, and infrared with grid-level sparse MoE
- (Few-Shot, 2025) demonstrated that a single training example with binary rewards can match models trained on thousands of annotated samples
- UAV-VL-R1 (UAV-VL-R1, 2025) showed a 2B model outperforming the 36Γ larger Qwen2-VL-72B through multi-stage GRPO curriculum learning
- SPF (See, Point, Fly, 2025) achieved 93.9% drone navigation success without any training by reframing action as spatial grounding
- (Text Before Vision, 2026) established SOTA on XLRS-Bench by injecting text-only Earth-science knowledge before agentic visual RL
- (GeoReason, 2026) introduced Logical Consistency Reward to combat reasoning hallucinations in spatial decision-making
- (DeepEarth, 2026) achieved planetary-scale 4D modeling with 99.3% parameter reduction through learned hash encoding
π Shift from supervised fine-tuning to reinforcement learning with verifiable rewards, enabling few-shot and even one-shot domain adaptation for remote sensing while unlocking structured reasoning capabilities.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Ground-Remote Vision-Language Alignment | Align satellite image encoders to CLIP's embedding space using co-located ground photos or OpenStreetMap tags as a semantic bridge, avoiding the need for direct satellite-text pairs. | Outperforms supervised RS VLMs by up to 20% on zero-shot classification (GRAFT) and achieves +6.2% average accuracy over baseline CLIP on seven benchmarks (SkyScript). | Remote Sensing Vision-Language Foundation Models... (2023), RS5M and GeoRSCLIP (2023), SkyScript (2023), A Recipe for Improving Remote... (2025), OSMDA (2026) |
| Reinforcement Learning with Verifiable Rewards for Remote Sensing | Use binary or IoU-based verifiable rewards with policy gradient optimization (GRPO) to fine-tune VLMs on remote sensing tasks, replacing thousands of annotated examples with minimal supervision. | 1-shot RLVR yields +11.65% on RSVQA-LR and +24.38% on DIOR-RS over the base model; Text-Before-Vision achieves 60.40% Pass@1 on XLRS-Bench, surpassing GPT-5.2 and Gemini 3.0 Pro. | Few-Shot (2025), SAMChat (2025), UAV-VL-R1 (2025), Text Before Vision (2026), GeoReason (2026) |
| Granularity-oriented Mixture of Experts | Route visual inputs to granularity-specific experts (image-level, region-level, pixel-level) using task-aware routers, preventing interference between different spatial reasoning requirements. | RSUniVLM achieves +29.7% accuracy on VRSBench-Ref visual grounding over GeoChat (69.31% vs. 39.6%) and 86.86% on SIRI-WHU scene classification versus GeoChat's 43.67%. | RSUniVLM (2024), RS-MoE (2024), SM3Det (2024) |
| Factorized Multi-Modal Foundation Pretraining | Factorize spatial, temporal, and modality dimensions into separate encodable components with multi-granularity contrastive learning, enabling flexible handling of varying input combinations. | SkySense surpasses Scale-MAE by +3.61% average across 16 datasets; MMEarth achieves +3.4% Top-1 accuracy over ImageNet baselines on land cover classification; DeepEarth achieves +35.0% RΒ² improvement with 99.3% parameter reduction. | SkySense (2023), MMEarth (2024), Self-Supervised (2026), LEPA (2026) |
| Training-Free VLM Aerial Navigation | Convert VLM visual understanding into 3D waypoints by grounding target locations as pixels in the image and unprojecting them using camera geometry, decoupling slow reasoning from fast control. | SPF achieves 93.9% success rate versus PIVOT's 28.7% (+65 points); AirHunt improves success rate by 49.1% and reduces navigation error by 80.3% over baselines. | See, Point, Fly: A Learning-Free... (2025), AirHunt (2026), ViSA-Enhanced Aerial VLN (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| VRSBench-Ref (Visual Grounding) | Accuracy (%) | 69.31% | RSUniVLM (2024) |
| XLRS-Bench (Ultra-High-Resolution RS Reasoning) | Pass@1 (%) | 60.40% | Text Before Vision (2026) |
| RSVQA-LR (Remote Sensing Visual Question Answering) | Accuracy (%) | +11.65% over base model with 1-shot training | Few-Shot (2025) |
| VisDrone (UAV Multi-Object Tracking) | MOTA (Multiple Object Tracking Accuracy) | SOTA on VisDrone dataset | MM-Tracker (2024) |
| DRL Simulator (Aerial Vision-Language Navigation) | Success Rate (%) | 93.9% (simulation), 92.7% (real-world) | See, Point, Fly: A Learning-Free... (2025) |
β οΈ Known Limitations (4)
- Cross-sensor generalization gap: models trained on optical imagery degrade significantly on SAR, infrared, and hyperspectral data due to fundamentally different imaging physics and appearance statistics. (affects: Ground-Remote Vision-Language Alignment, Reinforcement Learning with Verifiable Rewards for Remote Sensing, Granularity-oriented Mixture of Experts)
Potential fix: Multi-modal fusion frameworks like SkySense's factorized encoding and SM3Det's grid-level MoE can process multiple sensor types jointly, while dedicated SAR-optical alignment training may narrow the gap. - Counting and fine-grained quantification failure: even frontier VLMs consistently fail at counting objects in aerial imagery (RΒ² < 0.20), especially as density increases beyond 50 objects per scene. (affects: Ground-Remote Vision-Language Alignment, Granularity-oriented Mixture of Experts)
Potential fix: Dedicated counting heads or density estimation modules, combined with higher-resolution processing and region-level expert routing, could improve quantitative spatial reasoning. - Ultra-high-resolution processing bottleneck: satellite images can be tens of thousands of pixels, but VLMs are limited to small input sizes, requiring complex tiling or zoom-in strategies that increase latency. (affects: Reinforcement Learning with Verifiable Rewards for Remote Sensing, Factorized Multi-Modal Foundation Pretraining)
Potential fix: Agentic zoom-in tools (as in Text-Before-Vision), position encoding interpolation (as in GeoChat), and hierarchical patch processing can enable efficient UHR reasoning. - Logical hallucinations in spatial reasoning: models produce correct answers from flawed reasoning chains or rely on positional shortcuts, undermining reliability for strategic applications. (affects: Reinforcement Learning with Verifiable Rewards for Remote Sensing, Training-Free VLM Aerial Navigation)
Potential fix: Logical Consistency Rewards that penalize reasoning drift under option permutation (GeoReason) and explicit multi-phase verification pipelines (ViSA) can enforce grounded spatial logic.
π View major papers in this topic (10)
- SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery (2023-12) 9
- See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation (2025-09) 9
- Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards (2025-07) 8
- Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment (2023-12) 8
- RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts (2024-12) 8
- Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding (2026-02) 8
- SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing (2023-12) 8
- DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response (2025-05) 8
- Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding (2026-03) 8
- GeoChat: Grounded Large Vision-Language Model for Remote Sensing (2023-11) 8
π‘ Another cross-cutting theme examines Audio and Speech Integration.
Audio and Speech Integration
What: Research on integrating audio and speech signals into multimodal AI systems for joint understanding, reasoning, and generation across audio-visual-text modalities.
Why: Effective human-AI interaction requires understanding not just text and images, but also speech content, environmental sounds, and their complex relationships to visual context.
Baseline: Cascaded pipelines that first convert speech to text via ASR then feed transcripts to text-only language models, losing paralinguistic cues and environmental audio information.
- Cross-modal alignment between continuous audio signals and discrete text tokens causes information loss and modality conflicts during training
- Complex audio reasoning requiring temporal ordering, counting, and causal inference remains far below human-level performance
- Joint generation of temporally synchronized audio-visual content demands precise local alignment beyond global semantic matching
π§ͺ Running Example
Baseline: A cascaded ASR-plus-LLM pipeline would transcribe the chef's speech but lose vocal tone (confidence vs. hesitation), miss the sizzling sound entirely as non-speech audio, and lack temporal grounding to pinpoint when specific sound events occur.
Challenge: This example requires simultaneous speech understanding (recipe instructions), paralinguistic analysis (vocal confidence), environmental sound reasoning (sizzling timing), and temporal groundingβall integrated with visual context of the cooking process.
π Overall Progress
The field has undergone two major paradigm shifts in three years: first, from siloed audio processing to natively multimodal joint training (led by Gemini and AnyMAL in 2023), and second, from supervised learning to RL-enhanced reasoning (led by SARI and Omni-R1 in 2025). Concurrently, generation capabilities evolved from simple audio-gesture pairing to full cinematic audio-visual production at scale (Movie Gen, Seedance 1.5). The emergence of comprehensive benchmarks (MMAU, AHELM) has been instrumental in revealing the persistent gap between AI and human audio reasoning, which in turn accelerated the adoption of RL methods.
π Sub-topics
Audio-Language Understanding & Reasoning
14 papers
Models that perceive and reason about audio signalsβincluding speech, environmental sounds, and musicβusing language model backbones. This sub-topic covers architectures for audio comprehension and emerging RL-based methods for structured audio reasoning.
Omni-Modal Speech-Vision-Text Models
16 papers
Unified large language models that natively integrate speech and audio alongside vision and text, enabling end-to-end multimodal understanding and interaction without relying on external ASR or TTS systems.
Co-Speech Gesture & Animation Synthesis
11 papers
Generating realistic body gestures, facial animations, and full-body motion synchronized with speech audio, including diffusion-based approaches for stylized and semantically meaningful gesture generation.
Audio-Visual Content Generation & Synchronization
14 papers
Joint generation of synchronized audio and video content, including video-to-audio synthesis, music generation from visual inputs, and end-to-end movie production with coherent soundtracks.
Multimodal Emotion & Affect Recognition
8 papers
Detecting and reasoning about emotions by fusing audio (vocal tone, prosody), visual (facial expressions, gestures), and textual cues, including clinical applications like depression screening.
Audio-Visual Safety, Security & Benchmarking
16 papers
Evaluation frameworks for audio-language models, adversarial attacks exploiting audio modalities, deepfake detection, content moderation, and watermarking for joint audio-visual content.
π‘ Key Insights
π‘ GRPO-based RL training pushed audio reasoning from 53% to 71% on MMAU within one year
π‘ Text-only RL fine-tuning surprisingly improves audio QA nearly as much as audio-based training
π‘ Diffusion models halved gesture generation error compared to GAN-based approaches
π‘ Native multimodal joint training outperforms modular cascaded pipeline approaches
π‘ Audio-language models suffer catastrophic >98% accuracy drops under adversarial text inputs
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has rapidly converged on three fronts: (1) eliminating cascaded pipelines in favor of end-to-end omni-modal models, (2) applying reinforcement learning to unlock multi-step audio reasoning, and (3) scaling joint audio-visual generation from short clips to feature-length content with precise synchronization.
- DiffGesture (Taming Diffusion Models for Audio-Driven..., 2023) introduced the first diffusion-based approach for gesture generation, achieving state-of-the-art FGD of 1.506
- LTU-AS (Joint Audio and Speech Understanding, 2023) pioneered dual-path audio perception combining speech recognition with environmental sound understanding in a single LLM
- (Gemini, 2023) demonstrated native multimodal joint training across audio, vision, and text, exceeding human-expert MMLU performance
- (NExT-GPT, 2023) introduced the first end-to-end any-to-any MM-LLM connecting frozen encoders and diffusion decoders
- (Any-Modality, 2023) demonstrated scalable multimodal alignment with a frozen 70B LLM using quantized pre-training
π Shift from single-modality audio models and GAN-based gesture generation to LLM-integrated audio understanding and diffusion-based motion synthesis.
- (Movie Gen, 2024) scaled video generation to 30B parameters with synchronized 48kHz audio, setting new industry benchmarks
- (MMAU, 2024) established the first expert-level audio reasoning benchmark, revealing that the best model (Gemini Pro 1.5) achieves only 52.97% vs. 81.85% human accuracy
- (Video-MME, 2024) created the first full-spectrum video benchmark showing audio/subtitles boost performance by 4-6% on longer videos
- (Emotion-LLaMA, 2024) achieved top rank on the EMER challenge by aligning audio and multi-view visual encoders into the LLaMA embedding space
- Media2(Media2Face, 2024) created a trilogy of facial asset, 60-hour dataset, and latent diffusion model achieving 10.44mm Lip Vertex Error, outperforming EmoTalk by 28.5%
π Emergence of comprehensive audio benchmarks (MMAU, Video-MME) revealing that even top models achieve only ~53% on expert-level audio reasoning, catalyzing a push toward deeper reasoning capabilities.
- (SARI, 2025) extended GRPO to audio with structured CoT and curriculum learning, achieving 67.08% on MMAU
- Omni-R1 (Omni-R1, 2025) achieved 71.3% MMAU SOTA and discovered that text-only RL fine-tuning yields comparable audio QA improvements
- (Audio Flamingo Sound-CoT, 2025) achieved 79.83% on MMAU-Sound, surpassing GPT-4o Audio by +16.63 percentage points via chain-of-thought reasoning
- (AHELM, 2025) introduced the first standardized evaluation covering 10 aspects including fairness, safety, and bias across 14 ALMs
- mAVE (mAVE: A Watermark for Joint..., 2026) introduced cryptographic binding of audio-video watermarks to prevent swap attacks, achieving >99% binding integrity
π RL-based training (GRPO) emerges as the dominant paradigm for audio reasoning, with multiple independent groups applying it to push MMAU performance from ~53% to over 71% in under a year.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Omni-Modal Native Training | Ingest raw audio signals (at 16kHz) jointly with vision and text during pre-training rather than grafting audio encoders onto text-only models. | Gemini Ultra surpasses human-expert performance on MMLU with 90.04% (vs. 89.8% human baseline), achieving SOTA on 30 of 32 benchmarks. VITA-1.5 bridges the gap between open-source models and GPT-4o by eliminating separate ASR/TTS modules. | Gemini (2023), VITA-1.5 (2025), Lyra (2024), Any-Modality (2023), Gemini 2.5 (2025) |
| RL-Enhanced Audio Reasoning | Use GRPO to reward models for both correct answers and coherent reasoning chains, with curriculum learning ordering samples from easy to hard. | Omni-R1 achieves 71.3% on MMAU Test-mini, improving over base Qwen2.5-Omni by +5.4% absolute (65.9% β 71.3%). SARI achieves 67.08% on MMAU test-mini, +16.35% over Qwen2-Audio-7B-Instruct baseline. | SARI (2025), Omni-R1 (2025), Audio-Thinker (2025), Audio Flamingo Sound-CoT Technical Report:... (2025), EchoInk-R1 (2025) |
| Diffusion-based Co-Speech Motion Synthesis | Model gesture generation as a conditional diffusion process over skeleton or mesh sequences, with cross-modal attention for speech-gesture synchronization. | DiffGesture achieves FGD (FrΓ©chet Gesture Distance) of 1.506 on TED Gesture, halving the previous best HA2G score of 3.072. Media2Face achieves 10.44mm Lip Vertex Error, outperforming EmoTalk (14.61mm) by 28.5%. | Taming Diffusion Models for Audio-Driven... (2023), EMAGE (2023), Media2Face (2024), EchoMimicV3 (2025) |
| Joint Audio-Visual Diffusion Generation | Process video and audio streams in parallel within a unified diffusion backbone using cross-modal attention to enforce temporal lock-step synchronization. | MM-LDM outperforms MM-Diffusion by 114.6 FVD on AIST++ with 10x faster sampling speed. Movie Gen scales to 30B parameters for 1080p HD video with synchronized 48kHz audio, surpassing Runway Gen3 and OpenAI Sora. | Movie Gen (2024), Seedance 1.5 pro (2025), MM-LDM (2024), ThinkSound (2025), V2M-Zero (2026) |
| Dual-Path Audio-Language Architectures | Combine discrete speech tokens from an ASR decoder with continuous audio features from encoder layers to capture both what is said and how it sounds. | GAMA outperforms prior LALMs (LTU, SALMONN, Pengi) by 1-84% across diverse audio tasks. Sound-CoT achieves 79.83% on MMAU-Sound, surpassing GPT-4o Audio at 63.20% by +16.63 percentage points. | Joint Audio and Speech Understanding (2023), GAMA (2024), Audio Flamingo 2 (2025), MoE-Adapter (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MMAU (Massive Multi-Task Audio Understanding) | Accuracy (multiple-choice) | 71.3% on Test-mini | Omni-R1 (2025) |
| MMAU-Sound | Accuracy (multiple-choice) | 79.83% | Audio Flamingo Sound-CoT Technical Report:... (2025) |
| Video-MME | Accuracy (multiple-choice) | 81.3% (with subtitles) | Video-MME (2024) |
| TED Gesture (FGD) | FrΓ©chet Gesture Distance (FGD, lower is better) | 1.506 FGD | Taming Diffusion Models for Audio-Driven... (2023) |
| MMLU (Massive Multitask Language Understanding) | Accuracy | 90.04% | Gemini (2023) |
β οΈ Known Limitations (4)
- Severe textual bias in audio-language models: when text and audio conflict, models overwhelmingly trust text, with accuracy dropping from 87.8% to 1.7% under adversarial conditions, undermining reliability in real-world scenarios where modalities may disagree. (affects: Dual-Path Audio-Language Architectures, Omni-Modal Native Training)
Potential fix: MATA proposes training-free attention amplification for audio tokens, while MCR-Bench shows supervised fine-tuning on conflict-rich data can recover adversarial accuracy from 1.5% to 54.3%. - Persistent gap between AI and human audio reasoning: even the best models achieve 71.3% vs. 81.85% human accuracy on expert-level audio tasks, with cross-recording speaker identification remaining at chance level (<50%), indicating fundamental limitations in audio-language alignment. (affects: RL-Enhanced Audio Reasoning, Dual-Path Audio-Language Architectures)
Potential fix: Curriculum-guided RL (SARI) and synthetic reasoning data (AudioSkills in Audio Flamingo 2) show promise, but cross-recording reasoning and long-audio understanding remain open challenges. - Audio-visual security vulnerabilities: adversarial perturbations can inject hidden instructions into audio/images that steer model behavior while remaining imperceptible to humans, and voice-based jailbreak attacks achieve 77.8% success rate against GPT-4o's safety guardrails. (affects: Omni-Modal Native Training, Dual-Path Audio-Language Architectures)
Potential fix: Cryptographic audio-visual binding (mAVE) addresses watermarking integrity, and generator-internal probing (X-AVDT) improves deepfake detection by +13.1% accuracy, but defense against voice jailbreaks remains largely unsolved. - Computational cost and scalability barriers: state-of-the-art generation models require up to 30B parameters (Movie Gen) and massive compute, while long-audio processing beyond 5 minutes remains challenging for most architectures. (affects: Joint Audio-Visual Diffusion Generation, Diffusion-based Co-Speech Motion Synthesis)
Potential fix: EchoMimicV3 demonstrates competitive performance with only 1.3B parameters via soup-of-tasks paradigm, Phi-4-Multimodal uses Mixture of LoRAs to keep the base model frozen, and MambaDance replaces quadratic attention with linear-time state space models.
π View major papers in this topic (10)
- Gemini: A Family of Highly Capable Multimodal Models (2023-12) 10
- Movie Gen: A Cast of Media Foundation Models (2024-10) 9
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (2024-10) 9
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
- Phi-4-Mini and Phi-4-Multimodal (2025-04) 9
- AHELM: A Holistic Evaluation of Audio-Language Models (2025-09) 9
- mAVE: A Watermark for Joint Audio-Visual Generation Models (2026-03) 9
- Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities (2025-07) 9
- SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning (2025-04) 8
- Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation (2023-03) 8
π‘ Another cross-cutting theme examines Medical and Healthcare.
Medical and Healthcare
What: Research on adapting and developing multimodal AI models β integrating medical images, clinical text, and structured data β for diagnosis, report generation, and clinical decision support.
Why: Clinicians must integrate heterogeneous data across imaging modalities, patient history, and lab results, yet current AI systems are fragmented and lack clinical reasoning transparency.
Baseline: Standard approaches use single-modality supervised models trained on task-specific labeled datasets, or adapt general-purpose CLIP-style VLMs via supervised fine-tuning on medical image-text pairs.
- Medical data scarcity and privacy constraints limit large-scale multimodal training datasets
- Domain gap between natural images and medical images causes poor transfer of general VLMs
- Models frequently hallucinate clinical findings not supported by visual evidence
- Missing modalities at inference time due to heterogeneous clinical data availability
π§ͺ Running Example
Baseline: A standard VLM fine-tuned via SFT generates a plausible-sounding report but hallucinates findings not present in the image (e.g., fabricating 'pleural effusion'), fails to reference the prior study, and provides no reasoning for its conclusions.
Challenge: This example illustrates multiple key challenges: the model must perceive subtle visual abnormalities (domain gap), avoid hallucinating non-existent findings (factual accuracy), integrate longitudinal data (missing modality handling), and provide transparent reasoning steps (interpretability).
π Overall Progress
Medical multimodal AI has undergone three paradigm shifts in three years: from task-specific models to universal foundation models (2023), from 2D to native 3D volumetric understanding (2024), and from supervised fine-tuning to reinforcement-learning-driven reasoning (2025). The field is now converging toward deployable, transparent, multi-agent systems that combine specialist tools with verifiable reasoning chains.
π Sub-topics
Reinforcement Learning for Medical Reasoning
18 papers
Applying Group Relative Policy Optimization (GRPO) and Reinforcement Learning with Verifiable Rewards (RLVR) to medical VLMs, enabling emergent chain-of-thought reasoning without expensive expert annotations. This paradigm replaces supervised fine-tuning with reward-driven self-improvement.
Medical Vision-Language Foundation Models
35 papers
Large-scale foundation models pretrained on medical image-text data to serve as universal backbones for diverse clinical tasks including classification, segmentation, report generation, and visual question answering across 2D and 3D modalities.
Medical Multi-Agent and Agentic Systems
12 papers
LLM-orchestrated agent frameworks that coordinate specialized medical tools, enable multi-step clinical reasoning, and replicate collaborative diagnostic workflows through role-specialized agents and tool-augmented inference.
Robust Multi-Modal Clinical Data Fusion
28 papers
Methods for integrating heterogeneous clinical data (imaging, EHR, genomics, physiological signals) while handling missing modalities, imbalanced data, and modal inconsistencies common in real-world healthcare settings.
Medical Report Generation and Clinical Reasoning
22 papers
Automated generation of radiology and clinical reports with a focus on factual accuracy, interpretable reasoning chains, and adaptation to diverse clinical scenarios. Includes structured Chain-of-Thought approaches and retrieval-augmented methods.
Medical AI Benchmarks and Evaluation
15 papers
Standardized evaluation frameworks, benchmark datasets, and systematic analyses for assessing the quality, safety, robustness, and clinical relevance of medical multimodal AI systems.
π‘ Key Insights
π‘ Reinforcement learning with verifiable rewards outperforms supervised fine-tuning for medical reasoning
π‘ Small RL-trained models (2B) can surpass 72B supervised models on medical tasks
π‘ Multi-agent medical systems now match proprietary frontier models at 25x fewer parameters
π‘ Structured visual Chain-of-Thought with expert grounding reduces hallucinations dramatically
π‘ Missing modality robustness through shared-specific decomposition enables real-world deployment
π‘ 3D-native medical VLMs significantly outperform 2D-to-3D lifting approaches on volumetric tasks
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from adapting general-domain models (CLIP, SAM) to building purpose-built medical foundation models, and most recently to training these models via reinforcement learning to develop emergent clinical reasoning capabilities β all while increasing emphasis on transparency, safety, and clinical deployability.
- MedSAM (Segment Anything in Medical Images, 2023) adapted the Segment Anything Model to 1.5M medical image-mask pairs, reducing annotation time by 82%
- (Shared-Specific, 2023) introduced shared-specific decomposition for handling missing modalities in both segmentation and classification
- SleepFM (Multi-modal Representation Learning for Sleep, 2024) pioneered leave-one-out contrastive learning across brain, cardiac, and respiratory signals for sleep analysis
- Video pretraining study (Video Pretraining Advances 3D Deep Learning, 2023) demonstrated that natural video pretraining transfers effectively to 3D medical CT tasks
π MedSAM demonstrated that a single foundation model trained on diverse medical data could outperform task-specific models, establishing the universal medical AI paradigm.
- Merlin (CT Vision-Language Foundation Model, 2024) introduced 3D-native VLM for CT published in Nature, achieving +16% F1 in zero-shot findings classification
- M3D (3D Medical Image Analysis with MLLMs, 2024) created the largest 3D medical dataset with 120K image-text pairs and 662K instruction pairs
- MMedAgent (Learning to Use Medical Tools, 2024) introduced the first multi-modal medical agent framework with six specialized tools
- PRISM (Multi-modal Generative Foundation Model for..., 2024) adapted vision-language pretraining to gigapixel whole slide images using GPT-4 report summarization
- PaliGemma 2 (Versatile VLMs for Transfer, 2024) achieved SOTA on radiology report generation and demonstrated that general VLMs can replace specialized medical architectures
π Research shifted from adapting 2D natural-image models to building native 3D medical VLMs (Merlin, M3D) that process volumetric data directly.
- MedVLM-R1 (Medical VLM via Reinforcement Learning, 2025) demonstrated emergent medical reasoning via GRPO, boosting accuracy from 55% to 78% without reasoning annotations
- (Oracle-educated GRPO, 2025) achieved SOTA radiology report generation with only 1K training samples by introducing FactScore rewards
- (Domain-Aware, 2025) balanced training across clinical specialties with domain-aware policy optimization, boosting rare-domain F1 by 43%
- (Expert-Level, 2025) established a rigorous benchmark where even o1 achieves only 49.89% accuracy
- (Structured Visual Chain-of-Thought, 2025) created the first large-scale expert-annotated visually grounded reasoning dataset with 12K images
- Hulu-Med (Transparent Generalist Medical VLM, 2025) unified 2D/3D/video understanding in one architecture, surpassing GPT-4o on 16 of 30 benchmarks
π MedVLM-R1 triggered a paradigm shift from supervised fine-tuning to reinforcement learning for medical VLMs, demonstrating that small models (2B) with RL can outperform 72B supervised models.
- Meissa (Multi-modal Medical Agentic Intelligence, 2026) distilled agentic behaviors into a 4B-parameter model matching proprietary frontier agents with 22x lower latency
- MedMASLab (Unified Framework for Medical Multi-Agent Systems, 2026) standardized multi-agent medical AI evaluation across 11 architectures with semantic verification
- (Multi-Agent, 2025) surpassed human pathologists on melanoma grading using four collaborative specialized agents
- LoV3D (Longitudinal 3D Brain MRI Reasoning, 2026) achieved 93.7% dementia classification with verifiable structured JSON outputs and automated DPO
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| RL-based Medical Reasoning | Train medical VLMs via reward signals (format + accuracy) rather than imitation, enabling self-discovered reasoning without Chain-of-Thought labels. | Improves on SFT baselines by +23.1% average accuracy (MedVLM-R1: 78.22% vs 55.11% base), and OraPO achieves SOTA F1 of 0.357 on MIMIC-CXR using only 1K training samples vs 1.27M for prior methods. | MedVLM-R1 (2025), OraPO (2025), QoQ-Med (2025), MedEyes (2025), MedVLThinker (2025) |
| Medical Vision-Language Foundation Models | Pretrain unified architectures on millions of medical image-text pairs with domain-specific encoders, enabling zero-shot and few-shot transfer across clinical specialties. | MedGemma improves +15.5β18.1% on out-of-distribution CXR classification over base Gemma; Merlin achieves +16.0% F1 in zero-shot findings classification vs supervised baselines; MedSAM outperforms U-Net by 15.5% on unseen segmentation tasks. | Segment Anything in Medical Images (2023), Merlin (2024), MedGemma (2025), Hulu-Med (2025), M3D (2024) |
| Medical Multi-Agent Diagnostic Systems | Decompose medical diagnosis into role-specialized agents (triage, imaging, synthesis) that collaborate via structured communication, replacing monolithic end-to-end models. | Meissa matches proprietary frontier agents (GPT-4o) in 10 of 16 settings with 25x fewer parameters and 22x lower latency; MedAgent-Pro outperforms GPT-4o by 34% on glaucoma diagnosis; PathFinder surpasses human pathologists by 9% accuracy. | MMedAgent (2024), MedRAX (2025), Meissa (2026), PathFinder (2025), MedMASLab (2026) |
| Robust Multi-Modal Fusion with Missing Modalities | Decompose representations into shared (cross-modal) and specific (modality-unique) components, enabling graceful degradation when modalities are missing. | ShaSpec improves brain tumor segmentation Dice by >3β5% over prior methods on BraTS2018; CLoE achieves 88.09% Dice on Whole Tumor vs 87.54% best baseline; ACADiff maintains 89.4% diagnostic accuracy with 20% missing data. | Multi-modal Learning with Missing Modality... (2023), CLoE (2026), ACADiff (2026), DrFuse (2024) |
| Structured Visual Chain-of-Thought Reasoning | Structure medical reasoning as a multi-step cognitive process where each stage is visually grounded and independently verifiable, mimicking expert diagnostic workflows. | S-Chain supervision improves accuracy by +11.09% over base training and +4.47% over synthetic GPT-4.1 supervision; V2T-CoT achieves +5.11% on SLAKE over LLaVA-Med; ChestX-Reasoner improves reasoning by +18% over base model. | S-Chain (2025), ChestX-Reasoner (2025), Think Twice to See More:... (2025), Thinking with Gaze (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MIMIC-CXR (Radiology Report Generation) | RadGraph F1 | 0.357 F1 | OraPO (2025) |
| MedXpertQA MM (Expert-Level Medical Reasoning) | Accuracy | GPT-5 achieves super-human performance | Capabilities of GPT-5 on Multimodal... (2025) |
| VQA-RAD (Medical Visual Question Answering) | Accuracy | 83.20% | MMed-RAG (2024) |
| BraTS 2020/2021 (Brain Tumor Segmentation) | Dice Score | 0.912 average Dice | Modality-Aware (2024) |
| SLAKE (Medical VQA) | Accuracy | 86.3% GPT Score | Can Generalist Vision Language Models... (2025) |
β οΈ Known Limitations (4)
- Medical hallucinations remain pervasive β models fabricate findings, misidentify anatomy, or omit critical pathologies, with standard metrics failing to capture clinical danger levels (affects: Medical Vision-Language Foundation Models, RL-based Medical Reasoning (GRPO/RLVR), Structured Visual Chain-of-Thought Reasoning)
Potential fix: Structured visual grounding (S-Chain), FactScore-based rewards (OraPO), and concept bottleneck models that force intermediate clinical fact verification before report generation - Data scarcity and privacy constraints severely limit large-scale medical multimodal training, with most institutions holding fragmented, single-modality datasets that cannot be easily shared (affects: Medical Vision-Language Foundation Models, Robust Multi-Modal Fusion with Missing Modalities)
Potential fix: Federated learning with pseudo-modality generation (Fed-PMG), synthetic data generation from textbook knowledge (MM-Skin, MM-Retinal), and data-efficient RL training (OraPO achieves SOTA with 1K samples) - Evaluation fragmentation β benchmarks use inconsistent metrics, datasets, and prompting strategies, making fair comparison across methods nearly impossible and enabling overfitting to specific test sets (affects: Medical Multi-Agent Diagnostic Systems, Medical Vision-Language Foundation Models)
Potential fix: Unified evaluation toolkits (MultiMedEval), semantic VLM-based judges replacing brittle text matching (MedMASLab), and expert-curated difficult benchmarks with rigorous filtering (MedXpertQA) - Generalist-specialist trade-off β specialized medical VLMs excel in-distribution but fail on out-of-distribution modalities, while generalists lack clinical depth but generalize better after fine-tuning (affects: Medical Vision-Language Foundation Models, RL-based Medical Reasoning (GRPO/RLVR))
Potential fix: Lightweight domain adaptation via LoRA and prompt learning (GDPL), domain-aware RL that balances across specialties (QoQ-Med DRPO), and modular agent systems that dynamically select specialist tools (Meissa, MedRAX)
π View major papers in this topic (10)
- Segment Anything in Medical Images (2023-04) 9
- Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset (2024-06) 9
- OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation (2025-09) 9
- QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training (2025-05) 9
- Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding (2025-10) 9
- S-Chain: Structured Visual Chain-of-Thought for Medicine (2025-10) 9
- Meissa: Multi-modal Medical Agentic Intelligence (2026-03) 9
- MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems (2026-03) 9
- Capabilities of GPT-5 on Multimodal Medical Reasoning (2025-08) 9
- LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments (2026-03) 9
π‘ Another cross-cutting theme examines Safety and Robustness.
Safety and Robustness
What: Research on making multimodal models (especially Vision-Language Models) resistant to adversarial attacks, jailbreaks, hallucinations, data poisoning, and failures under distribution shifts.
Why: As VLMs are deployed in safety-critical applications like autonomous driving and healthcare, ensuring they cannot be manipulated or produce harmful content is essential.
Baseline: Standard VLMs inherit text-based safety alignment from LLMs but introduce new vulnerabilities through the visual modality, which bypasses existing safeguards.
- Visual inputs create a continuous attack surface that bypasses text-based safety filters and alignment
- Models hallucinate objects or fabricate information not grounded in visual evidence
- Safety fine-tuning often relies on spurious textual correlations rather than true understanding of harm
- Embodied agents face compounding errors where a single unsafe action can cause irreversible physical consequences
π§ͺ Running Example
Baseline: A standard VLM may provide dangerous chemical mixing instructions because its text-only safety alignment does not recognize the visual context as hazardous, or it may refuse all chemistry-related queries regardless of intent.
Challenge: This example illustrates the core challenges: (1) the image bypasses text safety filters since the text alone seems innocent, (2) the model must reason about cross-modal harm where text+image together are dangerous, and (3) over-refusal occurs if the model blocks all chemistry questions including safe ones.
π Overall Progress
The field has progressed from discovering that visual inputs bypass text safety alignment (2023) to developing sophisticated training-time and inference-time defenses (2024-2025), and now focuses on structured safety reasoning and consequence-aware policies (2025-2026). A critical paradigm shift occurred from binary safe/unsafe classification to structured reasoning chains that make safety decisions auditable and explainable. The arms race between attacks and defenses continues to intensify, with each side driving innovation in the other.
π Sub-topics
Jailbreak Attacks on VLMs
35 papers
Methods that exploit the visual modality to bypass safety alignment in Vision-Language Models, including adversarial image perturbations, typography-based attacks, and multi-modal prompt injection.
Safety Alignment and Training
30 papers
Training-time methods to align VLMs with safety requirements, including safety fine-tuning datasets, adversarial DPO, decoupled preference optimization, and reinforcement learning with safety constraints.
Inference-Time Safety Defense
25 papers
Training-free methods that protect VLMs at inference time, including activation steering, representation projection, suffix generation, and image-to-text conversion to restore LLM safety alignment.
Adversarial Robustness
25 papers
Methods for making vision encoders and VLMs robust to adversarial perturbations, including unsupervised adversarial fine-tuning, pre-trained model guided training, and large-scale adversarial pre-training.
Hallucination Detection and Mitigation
30 papers
Research on detecting and reducing hallucinations in multimodal models, including fine-grained human feedback, adversarial hallucination generation, sharpness-aware unlearning, and visual grounding techniques.
Safety Benchmarks and Evaluation
35 papers
Comprehensive benchmarks and evaluation frameworks for assessing VLM safety across dimensions including jailbreak resistance, hallucination rates, moral robustness, and reliability under visual corruptions.
Data Poisoning and Backdoor Attacks
20 papers
Attacks that inject malicious behaviors into VLMs through training data manipulation, including stealthy data poisoning, backdoor injection via model merging, and knowledge base poisoning in RAG systems.
Embodied and Agent Safety
25 papers
Safety evaluation and defense for VLM-powered embodied agents in autonomous driving, robotic manipulation, and household environments, addressing both adversarial attacks and natural failure modes.
Robustness to Modality Issues
22 papers
Research on maintaining model performance when modalities are missing, noisy, or conflicting, including missing modality adaptation, certifiable robustness, and cross-modal conflict resolution.
π‘ Key Insights
π‘ Visual inputs fundamentally bypass text-only safety alignment in VLMs
π‘ Safety fine-tuning with just 2,000 images can reduce attack success by 98%
π‘ Structured safety reasoning chains outperform binary refusal approaches
π‘ Multi-modal reasoning models have 37% higher jailbreak rates than base models
π‘ Spurious textual correlations create a 'safety mirage' easily broken by one-word attacks
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has evolved from isolated attack demonstrations to comprehensive safety frameworks spanning the full model lifecycle. Early work focused on proving vulnerabilities exist; current work increasingly addresses real-world deployment challenges including embodied agent safety, privacy-aware geolocation, and consequence-driven alignment for autonomous systems.
- Multi-modal indirect prompt injection (Abusing Images and Sounds for..., 2023) demonstrated adversarial perturbations in images/audio can hijack VLM behavior
- (Image Hijacks, 2023) introduced Behaviour Matching achieving 100% success in controlling VLM outputs via optimized images
- (MM-SafetyBench, 2023) discovered typography-based attacks increase ASR by 30%+ over text-only baselines
- (RLHF-V, 2023) pioneered fine-grained segment-level corrections for hallucination, reducing error by 34.8% with 1.4k samples
- (Robust Instruction Tuning, 2023) created the first large-scale dataset with negative instructions to teach models to say 'No'
π Discovery that multimodal inputs fundamentally bypass text-only safety alignment, establishing the visual modality as a critical attack surface.
- VLGuard (Safety Fine-Tuning at Almost No Cost, 2024) proved safety can be restored with just 2,000 curated images via mixed fine-tuning
- (Robust CLIP, 2024) achieved adversarial robustness at 0.2% of CLIP training cost using unsupervised feature consistency
- ECSO (Eyes Closed, Safety On, 2024) demonstrated +58.6% harmless rate improvement via training-free image-to-text conversion
- (Shadowcast, 2024) showed VLMs can be poisoned with as few as 50 stealthy samples
- (BadMerging, 2024) achieved >90% ASR against merged models where prior methods failed at <20%
- Red Teaming VLMs (Red Teaming Visual Language Models, 2024) established a comprehensive 4-aspect safety taxonomy (Faithfulness, Privacy, Safety, Fairness)
- (Safe RLHF-V, 2025) introduced decoupled dual-preference optimization with 7-point safety scale, achieving +34.2% safety improvement
- (Double Visual Defense, 2025) achieved ~70% robustness improvement through adversarial pre-training from scratch
- (Safety at Scale, 2025) unified safety research across six model types with 574 papers analyzed
- (SafeMLRM, 2025) quantified the 'Reasoning Tax' showing MLRMs have 37.44% higher jailbreak rates than base models
- (PRISM, 2025) introduced 4-step safety Chain-of-Thought with MCTS-generated preference pairs, achieving 0.15% ASR on JailbreakV-28K
- (GuardReasoner-VL, 2025) improved guard model F1 by +19.27% using online RL with safety-aware data concatenation
π Shift from binary safety classification to structured reasoning-based safety, where models must explain their safety decisions through explicit perception-reasoning-decision chains.
- OOD-MMSafe/(OOD-MMSafe, 2026) shifted from intent detection to causal projection, reducing risk identification failure to 5.7%
- (LabShield, 2026) evaluated 33 MLLMs on lab safety with OSHA/GHS standards, finding 32% performance drop in professional scenarios
- (SaFeR-ToolKit, 2026) formalized safety as a checkable protocol with virtual tool traces
- (ConflictBench, 2026) showed alignment failures occur at step 5.28 on average in multi-turn interactions, proving single-turn benchmarks miss delayed misalignment
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Multi-Modal Jailbreak Attacks | Visual inputs create a continuous, high-dimensional attack surface that circumvents discrete text safety filters via gradient optimization, encrypted visual encodings, or prompt injection. | Multi-Modal Linkage (MML) achieves 99.40% Attack Success Rate on GPT-4o, improving over prior baselines by +66.4% on SafeBench | Jailbreak Large Vision-Language Models Through... (2024), Image Hijacks (2023), MM-SafetyBench (2023), Cross-modal Adversarial Multimodal Obfuscation (CAMO) (2025), JPS (2025) |
| Safety-Aligned Fine-Tuning | Combine safety-specific training data with modified optimization objectives that separate helpfulness from safety constraints, preventing the model from learning spurious refusal shortcuts. | VLGuard Mixed Fine-Tuning reduces Attack Success Rate from 53.6% to 1.1% on LLaVA-v1.5-7B while maintaining helpfulness | Safety Fine-Tuning at (Almost) No... (2024), Safe RLHF-V (2025), SaFeR-VLM (2025), SaFeR-ToolKit (2026) |
| Inference-Time Safety Defense | Exploit the insight that visual embeddings create detectable anomalies in the model's representation space, which can be identified and corrected during inference via activation steering or image-to-text conversion. | ASTRA reduces Attack Success Rate by 17.84% over JailGuard while running 9x faster by avoiding multiple inference passes | Eyes Closed, Safety On: Protecting... (2024), ASTRA (2024), Understanding and Defending VLM Jailbreaks... (2026), VLM-Guard (2025) |
| Adversarial Robustness Training | Force the vision encoder to produce identical representations for clean and adversarially perturbed images, either through feature consistency losses or full adversarial pre-training on web-scale data. | Double Visual Defense (ΞΒ²-LLaVA) achieves ~70% absolute robustness improvement on Stanford Cars over prior methods (TeCoA, FARE) while maintaining clean performance | Robust CLIP (2024), Double Visual Defense (2025), Pre-trained Model Guided Fine-Tuning for... (2024), Anyattack (2025) |
| Hallucination Mitigation via Fine-Grained Feedback | Instead of ranking entire responses, collect precise corrections at the segment level and use modified optimization objectives that give higher weight to corrected regions, teaching models where exactly they hallucinate. | RLHF-V reduces hallucination rate by 34.8% using only 1.4k annotated samples, outperforming LLaVA-RLHF which required 10k samples | RLHF-V (2023), Beyond Superficial Unlearning (2026), GHOST (2025), Mitigating Hallucination in Large Multi-Modal... (2023) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MM-SafetyBench | Attack Success Rate (lower is safer for defense, higher for attack) | 1.1% ASR (defense), reduced from 53.6% baseline | Safety Fine-Tuning at (Almost) No... (2024) |
| JailBreakV-28K | Attack Success Rate (lower is safer) | 0.15% ASR | PRISM (2025) |
| POPE (Polling-based Object Probing Evaluation) | Accuracy | 85.0% accuracy on Random split | Mitigating Hallucination in Large Multi-Modal... (2023) |
| SafeBench / HADES (Jailbreak Attack Evaluation) | Attack Success Rate (higher indicates more effective attack) | 99.40% ASR on GPT-4o | Jailbreak Large Vision-Language Models Through... (2024) |
β οΈ Known Limitations (4)
- Safety-utility trade-off: Most defense methods reduce model helpfulness when improving safety, leading to over-refusal of benign queries that superficially resemble unsafe ones. (affects: Safety-Aligned Fine-Tuning, Inference-Time Safety Defense)
Potential fix: Machine unlearning (removing harmful knowledge) rather than supervised refusal, and structured reasoning chains that separate intent classification from response generation - Arms race dynamics: Each new defense is quickly circumvented by more sophisticated attacks, and defenses designed for known attack patterns fail to generalize to novel threats. (affects: Multi-Modal Jailbreak Attacks, Inference-Time Safety Defense)
Potential fix: Proactive defense frameworks that reason about potential harm rather than pattern-matching known attacks, such as consequence-aware safety policies - Evaluation gaps: Static benchmarks fail to capture temporal dynamics, multi-turn escalation, and interaction effects where individually safe components combine to create harm. (affects: Safety-Aligned Fine-Tuning, Hallucination Mitigation via Fine-Grained Feedback)
Potential fix: Interactive, process-oriented evaluation frameworks that test agents across multi-step scenarios with dynamic risk emergence, as proposed by IS-Bench and ConflictBench - Scalability of robust training: Adversarial pre-training from scratch is highly effective but requires enormous computational resources, limiting accessibility for the research community. (affects: Adversarial Robustness Training)
Potential fix: Efficient alternatives like FARE that achieve robustness at 0.2% of training cost, or test-time compute scaling approaches like Self-Critical Inference
π View major papers in this topic (10)
- Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety (2025-02) 9
- Jailbreak Large Vision-Language Models Through Multi-Modal Linkage (2024-11) 9
- Double Visual Defense: A Novel Adversarial Defense for Vision-Language Models (2025-02) 9
- OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences (2026-03) 9
- LabShield: A Multimodal Benchmark for Safety-Critical Reasoning in Scientific Laboratories (2026-03) 9
- Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models (2024-02) 8
- RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback (2023-12) 8
- Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings (FARE) (2024-02) 8
- IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks (2025-06) 9
- BadMerging: Backdoor Attacks Against Model Merging (2024-08) 9
π‘ Another cross-cutting theme examines Analysis.
Analysis
What: Research on evaluating, benchmarking, aligning, and understanding multimodal modelsβparticularly Vision-Language Models (VLMs)βacross diverse tasks, domains, and safety dimensions.
Why: As multimodal models are deployed in high-stakes settings like healthcare, autonomous driving, and embodied AI, rigorous analysis of their capabilities, failure modes, and alignment is essential for trustworthy deployment.
Baseline: Standard VLM evaluation relies on static, single-turn benchmarks with multiple-choice questions, aggregate accuracy metrics, and general-purpose prompting without domain adaptation.
- Benchmarks allow shortcut learning via text priors and guessing, inflating true capability estimates
- Models hallucinate confidently, failing to ground reasoning in visual evidence as sequences grow longer
- Safety alignment degrades when visual modality is added, enabling jailbreak attacks that bypass text-only defenses
π§ͺ Running Example
Baseline: A standard VLM might answer 'three pedestrians, safe to proceed' by relying on text priors about typical intersections, without actually counting individuals or detecting a partially occluded child in the crosswalk.
Challenge: This example illustrates multiple challenges: the model must count accurately (a known VLM weakness), reason spatially about occlusion, ground its answer in the actual image rather than language priors, and make a safety-critical judgment where hallucination could be catastrophic.
π Overall Progress
The field has evolved from basic capability cataloging to sophisticated diagnostic evaluation and mechanistic understanding. Early work (2023) established foundational benchmarks, but 2024 brought a paradigm shift toward live human-preference arenas and process-aware evaluation. The 2025 RL revolution introduced visually-grounded training methods that directly address the core problem of VLMs ignoring visual evidence. By 2026, research has converged on internal representation analysis for both improving capabilities and defending against attacks, while frontier benchmarks continue to expose fundamental gaps in spatial reasoning, visual tracking, and safety-critical deployment.
π Sub-topics
VLM Benchmarking & Evaluation
220 papers
Creating rigorous, diverse benchmarks to evaluate VLM capabilities across reasoning, perception, spatial understanding, cultural knowledge, and domain-specific tasks, while addressing shortcomings like shortcut learning and inflated scores.
Multimodal Reinforcement Learning & Alignment
120 papers
Applying reinforcement learning techniquesβparticularly GRPO and its variantsβto improve VLM reasoning, reward modeling, and human preference alignment, including novel reward model architectures.
Hallucination Detection & Factuality
80 papers
Identifying, measuring, and mitigating hallucinations in multimodal modelsβwhere generated text contradicts visual evidence or world knowledgeβthrough mechanistic analysis, spectral filtering, and multi-agent verification.
Safety, Robustness & Adversarial Analysis
70 papers
Evaluating and defending VLMs against jailbreak attacks, adversarial inputs, and safety failures across text and visual modalities, including red-teaming frameworks and defense mechanisms.
Domain-Specific Multimodal Analysis
167 papers
Adapting and evaluating multimodal models for specialized domains including medical imaging, autonomous driving, agriculture, remote sensing, and scientific applications.
π‘ Key Insights
π‘ Visual reasoning does not scale with model sizeβspecialized objectives matter more than parameters.
π‘ Live human-preference arenas achieve >0.94 correlation with human judgment, far surpassing static benchmarks.
π‘ Reinforcement learning with visual-token awareness yields 10-19% gains over standard GRPO on multimodal tasks.
π‘ VLMs frequently hallucinate because language priors overwhelm visual evidence as sequence length grows.
π‘ Safety alignment degrades dramatically when visual modality is added, enabling cross-modal jailbreaks.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has progressed from evaluating 'what VLMs can do' to understanding 'why they fail' through mechanistic interpretability, diagnostic benchmarks, and representation-level analysis, while simultaneously developing RL-based training methods that ground reasoning in visual evidence.
- (MM-SafetyBench, 2023) discovered typography-based visual jailbreaking, establishing the first multimodal safety evaluation framework.
- (MM-Vet, 2023) introduced capability integration evaluation, defining 6 core VL capabilities and their 16 combinations.
- (LAMM, 2023) created the first open-source instruction tuning dataset including 3D point clouds alongside images.
- (MRECG, 2023) and (PCR, 2023) enabled efficient deployment of diffusion and vision models.
- (WildVision, 2024) launched the first live VLM arena with Elo ratings achieving 0.94 correlation with human preferences.
- (VisionArena, 2024) scaled to 230K real-world conversations across 45 VLMs and 138 languages.
- (UniBench, 2024) consolidated 53 benchmarks revealing that reasoning capabilities do not scale linearly like recognition.
- Spider2-V (Spider2-V, 2024) introduced full-stack data science agent evaluation where GPT-4V achieved only 14% success.
- InternVL3 (InternVL3, 2025) pioneered native multimodal pre-training, achieving 72.2 on MMMU and setting a new SOTA for open-source MLLMs.
- (Merlin, 2024) introduced 3D-native CT vision-language pretraining, outperforming 2D baselines by +32.1% F1 on findings classification.
π Shift from static benchmarks to live human-preference arenas (WildVision, VisionArena) and from answer-only evaluation to process-aware assessment.
- (VPPO, 2025) introduced token-level visual perception masking for RL, achieving +19.2% improvement across eight benchmarks.
- (VisuLogic, 2025) showed SOTA models achieve near-random performance on visual logic problems that resist language shortcuts.
- SophiaVL-R1 (SophiaVL-R1, 2025) introduced thinking reward models that score entire reasoning processes rather than step-by-step.
- (CoreCognition, 2024) exposed that models fail at rudimentary developmental tasks while excelling at complex reasoning.
- (MM-MATH, 2024) introduced process evaluation via LMM-as-a-judge, revealing diagram misinterpretation accounts for >50% of errors.
π Emergence of visually-grounded RL methods (VPPO, AT-RL) that focus training on tokens with high visual dependency, alongside diagnostic benchmarks exposing fundamental VLM limitations.
- (Latent CoT, 2026) internalized chain-of-thought reasoning into efficient discriminative reward models, surpassing GPT-5 by +9.6%.
- (AT-RL, 2026) used graph-based anchor token identification for precise credit assignment in multimodal RL.
- (JRS-Rem, 2026) proposed representation-space defense reducing jailbreak success from 84% to 15%.
- (DatBench, 2026) achieved 13x evaluation speedup while improving discriminative power through data-centric curation.
- (LabShield, 2026) revealed a 32% performance drop when moving from text MCQs to embodied laboratory safety scenarios.
- (VET-Bench, 2026) proved visual entity tracking is NC1-complete and Molmo2-SGCoT achieved >90% accuracy where frontier models scored near random.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Visually-Grounded Reinforcement Learning for VLMs | Measure each token's visual dependency via attention or KL divergence, then weight RL policy updates to prioritize visually-grounded reasoning paths. | Improves on standard GRPO by +19.2% average accuracy on Qwen2.5-VL-7B across eight multimodal benchmarks (VPPO), and +8.24 points on five math benchmarks (AT-RL). | Spotlight on Token Perception for... (2025), Credit Where It's Due: Cross-Modality... (2026), Advancing Multimodal Reasoning via Reinforcement... (2025) |
| Generative & Latent Chain-of-Thought Reward Models | Train reward models to generate explanations alongside scores, then discard the generation head at inference to retain internalized reasoning in an efficient discriminative scorer. | Latent CoT achieves 85.1% accuracy on EditReward-Bench, surpassing GPT-5 (75.5%) by +9.6 points; EditScore-72B surpasses GPT-4o (84.41%) with 86.36% accuracy. | Joint Reward Modeling (2026), EditScore (2025), Skywork-VL Reward (2025) |
| Human-Preference Arena Evaluation | Deploy anonymous VLM battles in live platforms where users vote on preferred responses, converting pairwise wins into statistically robust Elo rankings. | WildVision-Bench achieves 0.94 Spearman correlation with human Elo ratings; VisionArena-Bench achieves 0.973 Spearman correlation, surpassing WildVision-Bench (0.802). | WildVision (2024), VisionArena (2024), CapArena (2025) |
| Hallucination Suppression via Internal Representation Analysis | Identify hallucination-prone directions in the model's representation space via eigendecomposition or probing, then dampen those specific modes in network weights or activations. | SRF achieves SOTA faithfulness on POPE and MSCOCO with zero inference latency overhead; VIB-Probe improves M-HalDetect AUROC by +2.84% over baselines. | Suppressing VLM Hallucinations with Spectral... (2025), VIB-Probe (2026), Multimodal large language models excel... (2024) |
| Multi-Modal Safety Jailbreak & Defense | Harmful content encoded in images (via typography or metaphorical encryption) bypasses text-aligned safety filters; defenses identify and subtract the jailbreak-specific activation shift in representation space. | Multi-Modal Linkage achieves 99.40% attack success rate on GPT-4o, improving over baselines by +66.4%; VLGuard Mixed Fine-Tuning reduces attack success from 53.6% to 1.1%. | MM-SafetyBench (2023), Jailbreak Large Vision-Language Models Through... (2024), Safety Fine-Tuning at (Almost) No... (2024), Understanding and Defending VLM Jailbreaks... (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MMMU (Massive Multi-discipline Multimodal Understanding) | Accuracy | 72.2% | InternVL3 (2025) |
| MathVista (Mathematical Visual Reasoning) | Accuracy | 73.4% | Advancing Multimodal Reasoning via Reinforcement... (2025) |
| EditReward-Bench | Accuracy | 86.36% | EditScore (2025) |
| VisionArena-Bench (Human Preference Correlation) | Spearman Correlation | 0.973 | VisionArena (2024) |
| MM-SafetyBench (Attack Success Rate Reduction) | Attack Success Rate (lower is better) | 1.1% ASR (down from 53.6%) | Safety Fine-Tuning at (Almost) No... (2024) |
β οΈ Known Limitations (4)
- Benchmark saturation and shortcut learning: Many benchmarks allow models to score well using text priors or guessing strategies without genuine visual understanding, inflating capability estimates. (affects: Human-Preference Arena Evaluation, VLM Benchmarking & Evaluation)
Potential fix: Convert multiple-choice to generative evaluation, filter questions solvable without visual input, and use controlled stimuli (CIVET, VisuLogic) that resist language shortcuts. - Reasoning-hallucination trade-off: Longer reasoning chains improve logical inference but degrade visual grounding, causing models to 'forget' the image as they reason more. (affects: Visually-Grounded Reinforcement Learning for VLMs, Hallucination Suppression via Internal Representation Analysis)
Potential fix: Use visual anchoring during reasoning (VAPO), moderate reasoning length via difficulty-aware budgets, or apply spectral filtering to maintain visual grounding. - Cross-modal safety gap: Adding a vision modality weakens the safety alignment of the underlying LLM, and current defenses remain fragile against sophisticated multi-modal attacks. (affects: Multi-Modal Safety Jailbreak & Defense)
Potential fix: Disentangle safety-relevant activation shifts from semantic shifts (ShiftDC), identify and project out jailbreak representation directions (JRS-Rem), or include safety data during visual instruction tuning. - Domain transfer gap: Models achieving strong general-purpose performance often fail catastrophically on specialized domains (medical, scientific, agricultural) where fine-grained visual details and domain knowledge are critical. (affects: Domain-Specific Multimodal Analysis)
Potential fix: Use domain-specific pretraining with expert data (KeepFIT, Merlin), construct specialized instruction-tuning datasets, or develop hybrid systems combining VLMs with domain-specific tools.
π View major papers in this topic (10)
- InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (2025-04) 9
- Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models (2026-02) 9
- WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences (2024-06) 9
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (2024-07) 9
- Spotlight on Token Perception for Multimodal Reinforcement Learning (2025-10) 8
- Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset (2024-06) 9
- Jailbreak Large Vision-Language Models Through Multi-Modal Linkage (2024-11) 9
- Can Vision-Language Models Solve the Shell Game? (2026-03) 9
- VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models (2025-04) 8
- Kimi K2.5: Visual Agentic Intelligence (2026-02) 9
π‘ Another cross-cutting theme examines Benchmark.
Benchmark
What: Research on constructing evaluation benchmarks, datasets, and metrics for systematically assessing multimodal models across vision-language understanding, reasoning, safety, and domain-specific tasks.
Why: Without rigorous, standardized evaluation, inflated benchmark scores and hidden model failures impede trustworthy deployment of multimodal AI in real-world applications.
Baseline: Early evaluation relied on narrow Visual Question Answering (VQA) datasets like VQAv2 with exact-match metrics, testing single-image perception in controlled settings.
- Existing benchmarks suffer from data contamination, language shortcuts, and multiple-choice guessing that inflate model scores
- Models pass high-level reasoning benchmarks yet fail rudimentary perception tasks like counting and spatial reasoning
- Evaluating open-ended, multi-modal, multi-turn interactions with subjective human preferences remains unsolved
π§ͺ Running Example
Baseline: A standard VLM benchmark would test this with a single multiple-choice question on a clean image, allowing the model to guess from textual cues without truly perceiving the scene.
Challenge: This example requires multi-modal reasoning (satellite imagery + geospatial context), fine-grained perception (counting small objects), domain expertise (disaster response), and compositional reasoning (route planning) β capabilities that no single existing benchmark adequately tests.
π Overall Progress
The benchmarking landscape has undergone a paradigm shift from narrow, single-capability VQA datasets to comprehensive, multi-dimensional evaluation ecosystems. Early work focused on establishing foundational benchmarks with robust anti-cheating mechanisms (CircularEval, blind baselines). The field then scaled to video, long-context, and domain-specific professional evaluation while incorporating live human preference signals. Most recently, research has turned to probing fundamental cognitive limitations, interactive safety in embodied settings, and culturally diverse evaluation, revealing that even frontier models fail at basic perception tasks that are trivial for humans.
π Sub-topics
General VLM Capability Benchmarks
95 papers
Comprehensive benchmarks evaluating core vision-language capabilities including perception, reasoning, OCR, spatial understanding, and knowledge integration across diverse tasks and formats.
Spatial, 3D, and Embodied Understanding Benchmarks
80 papers
Benchmarks targeting spatial perception, 3D scene understanding, embodied navigation, and physical reasoning β capabilities where VLMs consistently underperform humans despite strong general reasoning.
Safety, Robustness, and Hallucination Benchmarks
75 papers
Evaluations of model reliability under adversarial attacks, hallucination detection, safety alignment, privacy risks, and robustness to visual corruption or textual misinformation.
Domain-Specific Benchmarks and Datasets
120 papers
Specialized benchmarks for medicine, remote sensing, autonomous driving, agriculture, telecom, and other professional domains where general-purpose VLMs fail due to domain gaps.
Large-Scale Dataset Construction and Instruction Tuning
80 papers
Methods for constructing high-quality multimodal training datasets at scale, including synthetic data generation, instruction tuning data curation, and data quality filtering pipelines.
Reasoning, Mathematical, and Cognitive Benchmarks
60 papers
Benchmarks probing abstract reasoning, mathematical problem-solving, chain-of-thought quality, and core cognitive abilities in multimodal models, revealing fundamental gaps between model and human intelligence.
π‘ Key Insights
π‘ VLMs exploit textual shortcuts β many score higher without visual input than with it.
π‘ 56% of VLM reasoning failures trace to perception deficits, not logic errors.
π‘ Low-level cognitive abilities show zero improvement with model scaling.
π‘ Live arena evaluation outperforms static benchmarks in predicting human preference.
π‘ Domain-specific benchmarks reveal 30β50% capability gaps versus general evaluation.
π Show full analysis (timeline, methods, benchmarks)
π Timeline
The evolution follows a clear trajectory: from testing 'what models can do' (capability benchmarks) to testing 'what models cannot do' (cognitive profiling) to testing 'what models should not do' (safety and privacy evaluation), with increasing emphasis on ecological validity through real-world data and interactive environments.
- (Visual Instruction Tuning, 2023) pioneered GPT-assisted visual instruction data generation and the visual instruction tuning paradigm.
- (MMBench, 2023) introduced CircularEval and LLM-based answer extraction, establishing the gold standard for robust VLM evaluation.
- (MM-Vet, 2023) defined capability integration evaluation across 6 core VL skills with open-ended LLM scoring.
- (MVBench, 2023) systematically extended static image tasks to 20 dynamic video tasks, filling the temporal understanding evaluation gap.
- ShareGPT4V (ShareGPT4V, 2023) demonstrated that high-quality detailed captions scale VLM performance, gaining +36.1 points on MME.
- (MIMIC-IT, 2023) introduced multi-modal in-context instruction tuning with 2.8M samples across images and videos.
- (MM-SafetyBench, 2023) discovered typography-based visual jailbreaking, exposing critical safety vulnerabilities in VLMs.
π Shift from narrow VQA evaluation to comprehensive multi-capability benchmarking with LLM-as-judge evaluation, establishing the modern VLM evaluation paradigm.
- (Video-MME, 2024) created the first full-spectrum video benchmark spanning short to long durations with subtitle/audio integration.
- (WildVision, 2024) launched the first Chatbot Arena for VLMs, achieving 0.94 Spearman correlation with human preferences.
- (MathVerse, 2024) exposed that VLMs score higher on text-only versions of visual math problems, proving reliance on textual shortcuts.
- Spider2-V (Spider2-V, 2024) tested full-stack data science workflows in live VMs, where GPT-4V achieved only 14% success rate.
- (Merlin, 2024) established 3D-native medical VLM evaluation published in Nature, achieving +16% F1 zero-shot over supervised baselines.
- (CoreCognition, 2024) revealed VLMs show a reversed capability curve β failing low-level tasks that improve with human development.
- (VisionArena, 2024) scaled arena evaluation to 230K real conversations, achieving 97.3% correlation with live leaderboards.
- Pixtral-12B (Pixtral 12B, 2024) introduced RoPE-2D for native variable-resolution processing, outperforming 7x larger models on MMMU.
π Transition from single-image benchmarks to multi-modal, multi-turn, real-world evaluation including live human preference arenas and domain-specific professional tasks.
- (IS-Bench, 2025) introduced process-oriented interactive safety evaluation, showing all SOTA agents achieve <40% safe success rate.
- (SPINBENCH, 2025) demonstrated strong egocentric bias in spatial reasoning β models fail allocentric perspective taking entirely.
- (AHELM, 2025) created the first holistic audio-language model benchmark spanning diverse audio understanding tasks.
- (EditScore, 2025) achieved 86.36% accuracy on image editing reward evaluation, surpassing GPT-4o and GPT-5.
- (DatBench, 2026) achieved 13x evaluation speedup through data-centric curation, correcting inflated MCQ scores by ~35 points.
- (VRIQ, 2026) attributed 56% of VLM failures to perception-only deficits via parallel domain diagnostic benchmarking.
- (VLM-GeoPRIVACY, 2026) revealed that GPT-5 over-discloses sensitive location data 47.6% of the time.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Arena-Based Human Preference Evaluation | Users chat with two anonymous models simultaneously and cast preference votes, producing statistically robust Elo rankings from thousands of real-world interactions. | Improves on static benchmarks like MMBench by achieving 0.94β0.97 Spearman correlation with live human Elo ratings, versus 0.80 for prior automated evaluations. | WildVision (2024), VisionArena (2024), CapArena (2025) |
| Robust Anti-Shortcut Benchmark Design | Circular evaluation shuffles answer choices across multiple passes, while blind baselines verify models cannot solve questions without visual input, ensuring genuine multimodal understanding. | DatBench achieves 13x evaluation speedup and reveals ~35 point accuracy drop on AI2D when converting MCQ to generative format, correcting inflated capability estimates from prior benchmarks. | MMBench (2023), MathVerse (2024), DatBench (2026) |
| Hierarchical Cognitive Capability Profiling | Adapts established human cognitive frameworks (like Piaget's developmental stages or Gardner's Multiple Intelligences) to create hierarchical VLM diagnostics that isolate perception, attention, and reasoning failures. | Reveals that 56% of VLM failures stem from perception deficits, not reasoning, and that core cognitive abilities show no improvement with model scaling, unlike prior holistic benchmarks that masked these patterns. | Core Knowledge Deficits in Multi-Modal... (2024), Defining and Evaluating Visual Language... (2025), VRIQ (2026) |
| Domain-Adaptive Multi-Task Benchmark Suites | Integrates domain expert knowledge into benchmark construction with multi-granularity tasks spanning basic recognition to complex reasoning, using professional-grade data sources unavailable in web-scraped training sets. | Achieves +16% F1 zero-shot on findings classification over supervised training (Merlin on CT scans), while AgroBench reveals open-source models score only 30% on weed identification versus GPT-4o's 79%. | Merlin (2024), Spider2-V (2024), AgroBench (2025) |
| Process-Oriented Interactive Safety Evaluation | Triggers safety checks immediately before or after risk-prone actions in interactive environments, detecting intermediate unsafe behaviors that termination-based evaluation misses entirely. | Reveals that GPT-4o achieves <40% Safe Success Rate on IS-Bench and over-discloses location 47.6% of the time on VLM-GeoPRIVACY, failures invisible to prior binary Attack Success Rate metrics. | MM-SafetyBench (2023), IS-Bench (2025), VLM-GeoPRIVACY (2026) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MMBench | Accuracy with CircularEval (all shuffled passes must be correct) | Top-tier models achieve ~85% accuracy | MMBench (2023) |
| Video-MME | Accuracy (multiple-choice) | 81.3% with subtitles (Gemini 1.5 Pro) | Video-MME (2024) |
| MathVerse | Accuracy across Vision-Only to Text-Dominant problem versions | GPT-4V demonstrates the best visual comprehension but still drops significantly without text cues | MathVerse (2024) |
| VisionArena-Bench | Spearman correlation with live Arena Elo ratings | 97.3% Spearman correlation with live leaderboard | VisionArena (2024) |
| Spider2-V | Task Success Rate | 14.0% success rate (GPT-4V) | Spider2-V (2024) |
β οΈ Known Limitations (4)
- Data contamination and benchmark saturation: models may have seen test data during pretraining, inflating scores without genuine capability improvement. (affects: Robust Anti-Shortcut Benchmark Design, Arena-Based Human Preference Evaluation)
Potential fix: Multi-modal semantic perturbations can detect contamination without training data access; temporal separation and continuous benchmark renewal reduce leakage risk. - Evaluation cost and scalability: comprehensive benchmarks require expensive LLM-based judging, human preference collection, or interactive simulation environments that are slow and expensive to run. (affects: Arena-Based Human Preference Evaluation, Process-Oriented Interactive Safety Evaluation)
Potential fix: Data-centric subset selection achieves 13x speedup; automated judges with high human correlation reduce reliance on manual evaluation. - Cultural and linguistic bias: the vast majority of benchmarks are English-centric and Western-focused, systematically underestimating model failures on non-Western content. (affects: Domain-Adaptive Multi-Task Benchmark Suites, Hierarchical Cognitive Capability Profiling)
Potential fix: Culturally-sourced benchmarks with native annotators and multilingual parallel corpora enable fair cross-cultural evaluation. - Gap between benchmark performance and deployment readiness: models can pass academic evaluations while failing catastrophically in interactive, safety-critical, or time-pressured real-world scenarios. (affects: Robust Anti-Shortcut Benchmark Design, Process-Oriented Interactive Safety Evaluation)
Potential fix: Process-oriented interactive evaluation with dynamic risk generation in simulators; reliability-focused benchmarks with corruption and text-only baselines to detect blind reasoning.
π View major papers in this topic (10)
- Visual Instruction Tuning (2023-04) 9
- MMBench: Is Your Multi-modal Model an All-around Player? (2023-07) 8
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
- WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences (2024-06) 9
- VisionArena: 230K Real World User-VLM Conversations with Preference Labels (2024-12) 9
- MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? (2024-03) 9
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (2024-07) 9
- DatBench: Discriminative, Faithful, and Efficient VLM Evaluations (2026-01) 9
- IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks (2025-06) 9
- SPINBENCH: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs (2025-09) 9
π‘ Another cross-cutting theme examines Application.
Application
What: Research on deploying multimodal AI modelsβcombining vision, language, and actionβto solve real-world tasks across robotics, healthcare, driving, and specialized domains.
Why: Bridging the gap between general-purpose multimodal models and the domain-specific reliability, efficiency, and grounding required for practical deployment.
Baseline: General-purpose vision-language models applied zero-shot or with minimal adaptation to domain-specific tasks, often producing hallucinations and lacking actionable outputs.
- Domain gap: general VLMs lack specialized knowledge for fields like medicine, agriculture, and telecommunications
- Deployment efficiency: large models are too slow and resource-heavy for real-time edge applications like robotics and driving
- Evaluation realism: existing benchmarks use clean, curated data that masks failures on noisy, multilingual, real-world inputs
π§ͺ Running Example
Baseline: A general-purpose VLM like GPT-4V would describe the image at a high level ('a chest X-ray showing lungs') but miss subtle pathological cues like microaneurysms or small nodules, hallucinate non-existent findings, and produce vague reports lacking clinical terminologyβfailing the radiologist's need for precise, actionable diagnostic support.
Challenge: This example illustrates three key challenges: (1) the perception gapβgeneral visual encoders miss fine-grained lesions, (2) the reasoning gapβlanguage priors override weak visual signals, causing hallucinations, and (3) the deployment gapβcloud-based models introduce latency, cost, and privacy concerns incompatible with clinical workflows.
π Overall Progress
The field evolved from exploratory demonstrations of multimodal capabilities (GPT-4V, 2023) through rigorous domain-specific adaptation and evaluation (2024β2025) to production-ready agentic systems and lightweight models matching frontier performance (2025β2026). A key paradigm shift occurred when reinforcement learningβparticularly GRPO with rule-based rewardsβwas applied to VLMs, enabling dramatic reasoning improvements without learned reward models. Simultaneously, real-world benchmarks consistently exposed that even the best models achieve only 50β60% accuracy under realistic conditions, driving development of specialized, efficient deployment solutions.
π Sub-topics
Robotics & Embodied AI
14 papers
Vision-Language-Action (VLA) models combined with reinforcement learning for robotic manipulation, autonomous drone flight, and deployment-time reliability in unstructured real-world environments.
Healthcare & Biomedical AI
15 papers
Adapting multimodal models for clinical applications including radiology report generation, dermatology diagnosis, ophthalmology, and lightweight medical agentic systems that comply with privacy constraints.
Autonomous Driving & Transportation
12 papers
VLM-based perception, reasoning, and planning for autonomous vehicles, including fine-grained evaluation benchmarks, long-tail data curation, and visual chain-of-thought for driving theory.
Domain-Specific VLM Adaptation
35 papers
Methods for adapting general-purpose VLMs to specialized domains including agriculture, materials science, telecommunications, chart understanding, scientific visualization, and wildlife conservation.
Benchmarks & Real-World Evaluation
25 papers
New evaluation paradigms testing VLMs on high-resolution, noisy, multilingual, and domain-specific real-world scenarios that consistently expose 40β50% accuracy gaps between current models and human performance.
Efficient Models & Deployment
18 papers
Techniques for making multimodal models practicalβincluding quantization, compact VLMs, parameter-efficient fine-tuning, and unified training infrastructureβto enable edge deployment and reduce costs.
Agentic Systems & Visual Reasoning
25 papers
Multi-step agent pipelines, tool-use benchmarks, RL-enhanced visual reasoning, and agentic frameworks that orchestrate multiple models to solve complex real-world tasks with self-correction and reflection.
Multimodal Infrastructure & Signal Processing
18 papers
Hardware designs for mm-wave 5G communications, multimodal sensor fusion, image denoising, joint source-channel coding, and foundational infrastructure enabling multimodal AI deployment at scale.
π‘ Key Insights
π‘ Real-world benchmarks expose 50β60% accuracy ceiling for even the best multimodal models
π‘ Rule-based RL (GRPO) enables small VLMs to surpass models 10β30x their size
π‘ Domain-adaptive post-training with open-source data outperforms GPT-4-based methods
π‘ Agentic multi-step pipelines match frontier model performance at 20β90x lower cost
π‘ Visual noise and non-English languages cause 35%+ performance degradation in VLMs
π Show full analysis (timeline, methods, benchmarks)
π Timeline
Research has shifted from 'can VLMs do X?' to 'how reliably, efficiently, and safely can VLMs do X in the real world?'βdriving three converging trends: domain-specific adaptation with open-source data, real-world evaluation rigor, and agentic multi-step orchestration with lightweight on-premise models.
- Comprehensive GPT-4V exploration (The Dawn of LMMs, 2023) systematically documented LMM capabilities across domains including medical imaging, celebrity recognition, and abstract visual reasoning
- CLIP2 (Contrastive Language-Image-Point Pretraining, 2023) bridged 2D vision-language models to 3D point cloud understanding with +253% improvement on outdoor recognition
- SQNR-based mixed precision quantization (Practical Mixed Precision Algorithm, 2023) recovered BERT accuracy from 74.13% to 82.97% via label-free layer sensitivity analysis
- (Efficient Multi-Modal Assistant, 2024) proved 2.7B-parameter models could compete with 7B+ VLMs, opening the path to edge deployment
- STAR benchmark (Situated Reasoning in Real-World Videos, 2024) exposed a ~50% gap between machine and human situated reasoning ability
π GPT-4V demonstrated that large multimodal models could perform human-level reasoning across diverse visual domains, catalyzing an explosion of application-oriented research.
- MME-RealWorld (Benchmark for MLLM in the..., 2025) revealed that even GPT-4o fails to surpass 60% accuracy on real-world high-resolution tasks with human-annotated ground truth
- NVILA (Efficient Visual Language Models, 2024) introduced scale-then-compress architecture achieving +30% accuracy while cutting training costs by up to 5.1x
- (Model-Based, 2025) achieved the first end-to-end pixel-to-command autonomous drone flight via learned world models with 100% simulation success
- VLM-R1 (R1-style Visual RL, 2025) pioneered applying GRPO with rule-based rewards to visual tasks, enabling 3B models to surpass 7B baselines
- (Domain-Adaptive, 2024) established a systematic open-source domain adaptation pipeline outperforming GPT-4-based medical VLMs
π Research shifted from proving VLMs could do tasks to rigorously evaluating where they fail in the real world, spawning a wave of specialized benchmarks and domain-adaptive methods.
- RL-100 (Real-World, 2025) achieved 100% success rate across 1000 evaluations including continuous 7-hour zero-failure operation in a public shopping mall
- (Medical Agentic Intelligence, 2026) demonstrated a 4B-parameter agent matching GPT-4o across medical benchmarks at 22x lower latency
- 4(Agentic 4K Super-Resolution, 2025) introduced perception-restoration agent pipeline setting new SOTA across 11 task categories including medical and satellite imaging
- (VLM, 2025) exposed 35% performance drop from visual noise and severe English-first bias across 24 languages
- (Pinterest GEO, 2026) deployed VLM agents at production scale achieving 20% traffic growth across billions of images at 94x lower cost
π The field transitioned from single-model solutions to multi-agent orchestration and deployment-focused systems, with lightweight models matching frontier model capabilities at a fraction of the cost.
π¬ Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Vision-Language-Action Reinforcement Learning | Unify imitation and reinforcement learning under VLA architectures, using world models or dense vision-language critics to enable safe, sample-efficient real-world robot learning. | Improves on Diffusion Policy (DP3) baseline by +32.2% mean success rate (100% vs 67.8%) across 8 real-world manipulation tasks, achieving continuous 7-hour zero-failure operation (RL-100). | RL-100 (2025), Dream to Fly (2025), A Vision-Language-Action-Critic Model for Robotic... (2025), WMPO (2025) |
| Domain-Adaptive VLM Post-Training | Generate domain-specific visual instruction data using open-source pipelines, then fine-tune VLMs with progressive curricula spanning captioning, VQA, and reinforcement learning. | Improves on LLaVA-Med (GPT-4 generated) by +4.6% on VQA-RAD using only open-source models (AdaMLLM); MatterChat outperforms GPT-4o on formation energy estimation for novel materials. | On Domain-Adaptive Post-Training for Multimodal... (2024), AgriGPT-VL (2025), MatterChat (2025), MM-Telco (2025) |
| R1-Style RL for Visual Reasoning | Use tasks with deterministic answers (bounding boxes, exact matches) as rule-based rewards in GRPO, enabling stable RL training that improves VLM reasoning and out-of-domain generalization. | Improves on Supervised Fine-Tuning by +8.34 points on LISA-Grounding (63.16 vs 54.82) with 3B model surpassing 7B baseline on OVDEval (31.01 vs 29.08); RARL achieves +27% on unseen medical datasets. | VLM-R1 (2025), RARL (2025), UAV-VL-R1 (2025), Are Video Reasoning Models Ready... (2026) |
| Agentic Multi-Step Visual Processing | Decompose complex visual tasks into perception, planning, and execution stages with reflection, rollback, and quality-driven expert routing for robust, interpretable processing. | Meissa (4B parameters) matches GPT-4o in 10/16 medical settings with 25x fewer parameters and 22x lower latency; 4KAgent sets new state-of-the-art on RealSR benchmarks across 11 task categories. | Meissa (2026), 4KAgent: Agentic Any Image to... (2025), Generative Engine Optimization (2026), GTA (2024) |
| Scale-then-Compress Efficient VLM Architecture | Increase input resolution and frame count for maximum information capture, then apply spatial-to-channel reshaping, token pruning, or model distillation to compress representations for efficient processing. | Improves on VILA baseline by +30% accuracy on text-heavy benchmarks while reducing training costs by 1.9β5.1x and prefilling latency by 1.6β2.2x (NVILA); LLaVA-Phi (3B) outperforms 7B+ models on ScienceQA with 71.4% accuracy. | NVILA (2024), LLaVA-Phi (2024), SWIFT (2024) |
π Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MME-RealWorld | Accuracy (%) | <60% (GPT-4o / Gemini 1.5 Pro) | MME-RealWorld (2025) |
| MirageTVQA (Noisy Multilingual Tables) | Exact Match (%) | 25.52% EM clean / 16.50% EM noisy (Qwen2.5-VL-72B) | Lost in Translation and Noise:... (2025) |
| Real-World Robotic Manipulation (RL-100) | Success Rate (%) | 100% success rate across all tasks | RL-100 (2025) |
| LISA-Grounding (Out-of-Domain Visual Grounding) | Grounding Score | 63.16 (Qwen2.5-VL-3B + VLM-R1) | VLM-R1 (2025) |
β οΈ Known Limitations (4)
- Domain gap persistence: even after adaptation, VLMs hallucinate domain-specific details (e.g., medical findings, rare species) because pre-trained visual encoders lack fine-grained domain features like microaneurysms or subtle crop diseases (affects: Domain-Adaptive VLM Post-Training, R1-Style RL for Visual Reasoning)
Potential fix: Dual-stream encoding with specialized domain encoders fused via learned gates, as demonstrated by Deep Expert Injection achieving +12.55% precision improvement over simple addition - Robustness to real-world degradation: models trained on clean data suffer catastrophic performance drops (35%+) when facing noisy, low-light, blurred, or compressed inputs typical of deployment environments (affects: Scale-then-Compress Efficient VLM Architecture, Domain-Adaptive VLM Post-Training)
Potential fix: ROVA-style robustness training with structured spatio-temporal corruptions and consistency rewards between clean and perturbed branches, boosting perturbed accuracy by 24%+ - Evaluation-deployment mismatch: benchmarks using clean data, multiple-choice formats, and English-only content overestimate real-world capabilities, especially for safety-critical domains like autonomous driving and medicine (affects: Vision-Language-Action Reinforcement Learning, Agentic Multi-Step Visual Processing)
Potential fix: Hierarchical fine-grained benchmarks (like VLADBench with 29 tertiary tasks) combined with closed-loop real-world testing and deployment-time monitoring frameworks that detect distribution shift - Sample efficiency and safety in real-world RL: real-robot interactions are expensive and risky, and learned world models may not faithfully capture edge cases in unstructured environments (affects: Vision-Language-Action Reinforcement Learning)
Potential fix: World model-based imagination (WMPO) for safe off-robot policy learning, combined with runtime monitoring hierarchies and feasibility-aware task planning that maximizes joint success probability
π View major papers in this topic (10)
- RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning (2025-10) 9
- Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight (2025-01) 9
- NVILA: Efficient Visual Language Models from Pre-training to Deployment (2024-12) 9
- Meissa: Multi-modal Medical Agentic Intelligence (2026-03) 9
- MME-RealWorld: A Benchmark for MLLM in the Real World (2025-02) 9
- Lost in Translation and Noise: A Deep Dive into Failure Modes of VLMs on Real-World Tables (2025-11) 9
- 4KAgent: Agentic Any Image to 4K Super-Resolution (2025-07) 9
- Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth (2026-02) 9
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) (2023-09) 9
- VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model (2025-04) 8
π‘ Another cross-cutting theme examines Survey.
Survey
- MM-LLMs: Recent Advances in MultiModal Large Language Models (2024-01) 9
- Tutorial on Diffusion Models for Imaging and Vision (2024-03) 9
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024-05) 9
- Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey (2024-07) 9
- Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety (2025-02) 9
- Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers (2025-06) 9
- Explain Before You Answer: A Survey on Compositional Visual Reasoning (2025-08) 9
- MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents (2025-08) 9
- CRAG-MM: A Comprehensive Benchmark for Multi-modal Multi-turn Retrieval-Augmented Generation (2025-10) 9
- MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning (2026-01) 9
π― Practical Recommendations
| Priority | Recommendation | Evidence |
|---|---|---|
| High | Adopt reinforcement learning with verifiable rewards (GRPO) as the default post-training paradigm for multimodal models, as it consistently outperforms supervised fine-tuning by 10-30% across understanding, generation, and robotics tasks with minimal human annotation | R1-Zero showed a 2B GRPO-trained model outperforms SFT-tuned 72B models; SimpleVLA-RL achieves 91.7% from a single demo; Flow-GRPO boosts GenEval from 63% to 95% |
| High | Integrate grounded chain-of-thought reasoning into multimodal systems by requiring models to output bounding box coordinates alongside text reasoning, which reduces hallucination rates by 30-55% and improves answer-grounding consistency | GCoT revealed that even 72B models achieve only 11.1% grounding consistency despite 75.7% accuracy; grounded CoT improves consistency by +55.7% |
| High | Use dynamic visual token compression (pruning 50-76% of tokens) combined with resolution routing to achieve 4-10x inference speedup for production multimodal deployments with minimal accuracy loss | METEOR prunes 76% of tokens with only 0.3% accuracy drop; InternVL3.5 achieves 4.05x speedup scoring 77.7 on MMMU; AIM reduces FLOPs by 6.8x |
| High | Deploy compact sub-1B specialized models for document parsing and OCR tasks, as they match or exceed 100x-larger general-purpose VLMs on structured document understanding | GLM-OCR (0.9B) ranked first on OmniDocBench v1.5 outperforming GPT-5.2; olmOCR processes PDFs at 35x lower cost than GPT-4o |
| Medium | Use process reward models for step-level supervision during multimodal reasoning, as they improve performance across model scales by +5.9 points and enable effective test-time compute scaling via best-of-N selection | VisualPRM-8B improves even 78B models by +5.9 points; DreamPRM reaches 85.2% on MathVista; Athena-PRM achieves 83.1 F1 at 1/45th GPU cost |
| Medium | Adopt dual-system hierarchical architectures (fast + slow) for embodied AI applications requiring both high-level reasoning and real-time control, achieving 100+ Hz reactive control alongside VLM-quality planning | Fast-in-Slow achieves 117.7 Hz control and +11% over OpenVLA; OneTwoVLA achieves +30% on long-horizon tasks with autonomous mode switching |
| Medium | Evaluate multimodal models on process-level reasoning quality rather than just final answer accuracy, since correct answers frequently mask severe intermediate hallucinations β top models show 70.6% accuracy but only 22.8% thinking correctness | MM-THEBench revealed 70.6% answer accuracy but only 22.8% thinking correctness; GCoT found inverse scaling where larger models ground worse |
| Low | Leverage world models for safe robotic policy training in imagination before real-world deployment, reducing data requirements by 3-4x while enabling evaluation of candidate trajectories against predicted collisions | DINO-WM enables zero-shot planning with +45% success rate; PlayWorld improves real-world policy success by 65%; Kinematics-aware models achieve +23.1% mean return |
π Key Takeaways
RL Replaces Supervised Learning
Group Relative Policy Optimization (GRPO) has become the universal post-training paradigm, simultaneously transforming visual reasoning, image generation, video understanding, and robotic control. Small 2-7B models trained with GRPO consistently outperform 72B+ supervised counterparts, proving that RL teaches transferable reasoning while SFT merely memorizes patterns.
GRPO-trained small models outperform 10x-larger supervised ones everywhere.
Perception Is the Real Bottleneck
Across visual reasoning, video understanding, and robotics, 72-78% of errors stem from incorrect visual perception rather than flawed logic. Larger models paradoxically ground worse β 72B models show only 11.1% answer-grounding consistency despite 75.7% accuracy. This reveals that scaling alone cannot solve visual understanding.
Visual perception, not reasoning logic, causes most multimodal failures.
Grounded Reasoning Halves Hallucinations
Requiring models to produce bounding box coordinates and spatial evidence alongside text reasoning reduces hallucination rates by 30-55% and exposes the gap between correct answers and correct reasoning. Training-free attention interventions like OPERA can achieve +35.8% improvement without any model retraining.
Spatial grounding in reasoning chains cuts hallucinations in half.
Robots Achieve Near-Perfect Manipulation
RL post-training has broken the imitation learning ceiling for robotic manipulation, with systems achieving 99-100% success rates on real-world tasks and operating continuously for 7 hours in public environments. Dense process reward models enable one-shot adaptation from near-zero to 95% success with only 150 rollouts.
RL-enhanced robots achieve 100% real-world success rates continuously.
Compact Models Beat Giants
Across document parsing, GUI grounding, image generation, and robotic control, specialized compact models consistently outperform models 10-100x their size. A 0.9B document parser beats GPT-5.2, a 7B GUI agent outperforms 72B UI-TARS, and a 2.6B one-step generator surpasses 12B FLUX-dev.
Specialized sub-1B models consistently outperform 100x-larger generalists.
Generation Meets Understanding
The boundary between understanding and generation is dissolving. Unified models like MMaDA and Mogao jointly reason and generate, while chain-of-thought before generation improves compositional accuracy by 89-160%. Reasoning before acting has become the dominant paradigm from image creation to robotic control.
Reasoning before generating boosts quality 89-160% across modalities.
π Emerging Trends
Test-time compute scaling allows small models to match much larger ones by spending additional computation at inference through evolutionary search, tree-based exploration, or process reward model-guided best-of-N selection
EvoSearch with Wan 1.3B matches the 10x larger Wan 14B model; VisVM-guided captions are preferred 74% over greedy decoding; DreamPRM reaches 85.2% on MathVista via best-of-N with o4-mini
Self-improving agentic systems that autonomously collect data, refine their own outputs, and learn from experience without human intervention are emerging across generation, editing, and robotics
SIDiffAgent uses Theory-of-Mind inspired self-improvement with +8.73% on GenAIBench; PlayWorld learns from autonomous robot play improving success by 65%; PLD's residual RL achieves self-improvement to 99% success
Unified understanding-generation models that jointly perceive and create across modalities are replacing separate specialized systems, with explicit reasoning bridging the cognitive gap between comprehension and synthesis
MMaDA surpasses autoregressive LLMs on reasoning while excelling at generation; ImageGen-CoT improves compositional accuracy by 89-160% via structured reasoning before generation; Mogao achieves 83.3% MME while enabling interleaved multi-modal generation
Physical world simulation via learned world models is enabling robots and autonomous vehicles to train entirely in imagination before real-world deployment, with explicit kinematics grounding and causal reasoning
DINO-WM enables zero-shot planning with frozen foundation model features; Kinematics-aware latent models reduce data needs by 4x; IRL-VLA eliminates sensor simulation via reward world models
Multi-scene narrative video generation and audio-visual joint synthesis are extending video generation from short single clips to minutes-long coherent storytelling with synchronized audio
Long Context Tuning generates coherent 20-shot 3-minute videos; Seedance 1.5 Pro achieves native joint audio-visual generation; COMIC produces fully automated comedy videos via multi-agent collaboration
π Research Opportunities
Bridging the massive human-AI gap on abstract visual logic β top models achieve only 31.1% on VisuLogic versus 51.4% for humans, and near-random on tasks requiring genuine spatial reasoning beyond language shortcuts
Despite the RL revolution, fundamental visual perception and abstract reasoning remain dramatically weaker than human cognition. This gap limits deployment in safety-critical applications requiring reliable visual understanding.
Difficulty: High Impact: HighDeveloping robust multilingual and culturally-aware multimodal models β current systems show up to 30+ percentage point performance drops on non-Western cultural concepts and low-resource languages
The field is overwhelmingly English-centric with Western cultural bias embedded in training data, evaluation benchmarks, and model design, severely limiting global applicability.
Difficulty: Medium Impact: HighSolving physical plausibility in video generation β models produce visually stunning but physically impossible videos that violate gravity, object permanence, and fluid dynamics, limiting use in simulation and robotics
High visual fidelity scores mask fundamental physics violations. As video generation moves into world modeling for autonomous driving and robotics, physical accuracy becomes safety-critical.
Difficulty: High Impact: HighCreating unified cross-domain evaluation standards that test process-level reasoning quality rather than just final answer accuracy, exposing models that achieve correct answers through hallucinated reasoning
Current benchmarks are fragmented across incompatible metrics, and models achieving 70.6% accuracy show only 22.8% thinking correctness. Standard MCQ formats allow text-based elimination without genuine visual understanding.
Difficulty: Medium Impact: HighScaling RL training efficiency for visual models β full trajectory sampling with large group sizes makes GRPO prohibitively expensive, limiting its accessibility to well-resourced labs
Despite GRPO's effectiveness, the computational cost of generating multiple candidate outputs per prompt creates significant barriers. Single-rollout and tree-structured approaches show promise but need further development.
Difficulty: Medium Impact: MediumLong-horizon interleaved multi-modal generation that maintains quality beyond 20 visual events β current unified models collapse after approximately 20 generated images regardless of text token count
The event bottleneck phenomenon limits practical applications like storybook generation, long-form document creation, and multi-turn visual dialogue to very short sequences.
Difficulty: High Impact: Mediumπ Benchmark Leaderboard
MMMU (Expert-Level Multimodal Understanding)
Expert-level multimodal reasoning across 30 college subjects requiring domain knowledge and deliberate reasoning over diverse image types (Metric: Accuracy (%))
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| π₯ | InternVL3.5 with Visual Resolution Router and Cascade RL | 77.7% β +22.0% over GPT-4V (55.7%), still trailing human 88.6% | InternVL3.5 (2025) | 2025 |
| π₯ | Kimi-VL with native-resolution MoE decoder | 64.0% β +8.3% over GPT-4V (55.7%) | Kimi-VL (2025) | 2025 |
MathVista (Visual Mathematical Reasoning)
Multimodal mathematical reasoning combining visual perception with problem-solving across geometry, statistics, and scientific figures (Metric: Accuracy (%))
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| π₯ | DreamPRM with o4-mini via best-of-N selection | 85.2% β +35.3% over original GPT-4V baseline (49.9%) | DreamPRM (2025) | 2025 |
| π₯ | Kimi k1.5 with long-context RL | 74.9% β +25.0% over GPT-4V baseline | Kimi k1.5 (2025) | 2025 |
GenEval (Compositional Text-to-Image Generation)
Compositional accuracy in text-to-image generation across object counting, attribute binding, and spatial relationships (Metric: Overall Accuracy (%))
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| π₯ | DiffusionNFT with forward-process RL | 98% β +55.6% absolute over base SD3.5-M (63%) | DiffusionNFT (2025) | 2025 |
| π₯ | Flow-GRPO with ODE-to-SDE conversion | 95% β +32% over base SD3.5-M | Flow-GRPO (2025) | 2025 |
LIBERO (Multi-Task Robotic Manipulation)
Multi-task robotic manipulation generalization across objects, scenes, and long-horizon task sequences in simulation (Metric: Success Rate (%))
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| π₯ | Probe-Learn-Distill with residual RL agents | 99.0% β +22.5% over standard OpenVLA (76.5%) | Self-Improving (2025) | 2025 |
| π₯ | VLA-Thinker with dynamic visual perception actions | 97.5% β +6.5% over OpenVLA-OFT (91.0%) | VLA-Thinker (2026) | 2026 |