MM-THEBench: Do Reasoning MLLMs Think Reasonably?

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Reasoning and Chain-of-Thought (CoT) Evaluation Hallucination detection

MM-THEBench evaluates the intermediate thinking process of reasoning MLLMs, revealing that models often produce correct final answers despite hallucinations in perception or reasoning steps.

Core Problem

Current benchmarks for reasoning MLLMs focus only on final answer correctness, ignoring the internal 'thinking' process. This masks cases where models get the right answer for the wrong reasons due to hallucinations in intermediate steps.

Why it matters:

Reasoning MLLMs (like OpenAI o1/o3) generate long Chains-of-Thought (CoT), but users cannot verify if the reasoning is sound or coincidentally correct
Existing hallucination benchmarks lack fine-grained taxonomies for intermediate steps, failing to distinguish between perceptual errors and logical failures during the thinking process
The free-form, unstructured nature of intermediate CoTs makes scalable automated evaluation difficult compared to simple final-answer checking

Concrete Example: A model might correctly identify a final answer to a visual question, but its intermediate CoT claims to see objects that aren't there (perception hallucination) or uses flawed logic (reasoning hallucination). For instance, correctly guessing a physics problem answer while misinterpreting the diagram's forces.

Key Novelty

MM-THEBench (Multimodal Thinking Hallucination Evaluation Benchmark)

Introduces a fine-grained, two-layer hallucination taxonomy for intermediate thoughts, categorizing errors into three cognitive dimensions: Knowledge, Perception, and Reasoning
Transforms existing datasets into a process-aware benchmark by annotating 1,340 questions with verified atomic reasoning steps and evaluation rubrics
Implements a multi-level automated evaluation framework using an LLM-as-a-judge to assess answer accuracy, step-level alignment, and rubric-based hallucination scoring

Architecture

The MM-THEBench evaluation framework, illustrating the pipeline from model output to multi-level evaluation.

Evaluation Highlights

Qwen3-VL-235B-A22B-Thinking achieves the highest final answer accuracy (70.62%) on image tasks, but intermediate step precision is only 22.75%
Thinking correctness consistently lags behind final answer accuracy across 14 models; e.g., GPT-5 shows high accuracy but lower rubric scores in perception compared to reasoning
Perception hallucinations are the most frequent error type but rarely cause wrong answers, whereas reasoning hallucinations are strongly correlated with incorrect final outcomes

Breakthrough Assessment

8/10

Significant contribution to the interpretability of reasoning MLLMs. It moves evaluation beyond final-answer accuracy to the validity of the reasoning process itself, addressing a critical gap in trustworthy AI.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Question Answering with evaluation of intermediate Chain-of-Thought (CoT) fidelity

Inputs: Multimodal input (image/video + text question)

Outputs: Intermediate reasoning steps (CoT) and final answer

Pipeline Flow

Step 1: Data Collection & Filtering (from existing benchmarks)
Step 2: Annotation (Step generation via Gemini-2.5-Pro → Human Verification → Rubric Generation)
Step 3: Model Inference (Extract explicit thinking or force CoT via prompting)
Step 4: Multi-level Evaluation (Answer checking → Step matching → Rubric scoring via Judge)

System Modules

Annotator

Generate reference reasoning steps and evaluation rubrics

Model or implementation: Gemini-2.5-Pro (followed by human verification)

Target Model

Generate reasoning and answers for evaluation

Model or implementation: Various MLLMs (e.g., GPT-5, Qwen3-VL, Claude-3.5)

Judge Model

Assess correctness of answer and quality of reasoning

Model or implementation: Qwen-3-32B

Novel Architectural Elements

Two-layer hallucination taxonomy specifically for intermediate CoTs: Top layer (Knowledge, Perception, Reasoning) -> Subcategories (e.g., OCR, Spatial, Deductive)
Atomic step-based rubric evaluation system designed to match free-form thinking processes against structured ground truth

Modeling

Base Model: Qwen-3-32B (as Judge)

📊 Experiments & Results

Evaluation Setup

Multimodal QA across 8 domains (Math, Charts, Video, Spatial, etc.) using 1,340 samples

Benchmarks:

MM-THEBench (Multimodal Reasoning and Hallucination Evaluation) [New]

Metrics:

Answer Accuracy (Acc)
Step-level Precision/Recall/F1
Rubric-level Scores (Knowledge, Perception, Reasoning)
H-score (Hallucination-free score)
Statistical methodology: Human meta-evaluation of the judge model on 300 instances to verify agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Answer-level accuracy shows large-scale models leading, but small models remain competitive.
MM-THEBench (Image)	Accuracy	57.77	70.62	+12.85
MM-THEBench (Image)	Accuracy	69.57	70.62	+1.05
Step-level analysis reveals low precision in explicit thinking models compared to prompted CoT, suggesting verbosity or redundancy in 'thinking' outputs.
MM-THEBench (Image)	Precision	58.07	22.75	-35.32
Rubric-level analysis shows that while models perceive poorly (low Perception scores), they maintain high Hallucination-free scores (H-score), suggesting they omit details rather than hallucinating them.
MM-THEBench (Image)	Perception Score	53.60	64.84	+11.24
MM-THEBench (Image)	H-score (Hallucination-free)	90.26	95.55	+5.29

Experiment Figures

Stacked bar chart showing the distribution of hallucination types (Perception, Reasoning, Knowledge, Mixed) relative to answer correctness (Correct vs. Incorrect) for various models.

Main Takeaways

Correctness of intermediate CoTs lags significantly behind final answer accuracy; models often get the right answer with flawed reasoning.
Perception hallucinations are the most frequent type but have a low correlation with incorrect answers (models survive perceptual errors).
Reasoning and Mixed hallucinations are much more fatal, showing a strong association with incorrect final outcomes.
Spatial-related hallucinations dominate both perception and reasoning error categories.
Models evaluated via CoT prompting (like GPT-5) produce more concise and higher-precision steps than models with native 'Thinking' modes (like Qwen-Thinking), which tend to be verbose and redundant.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) reasoning
LLM-as-a-judge evaluation methodologies

Key Terms

Reasoning MLLMs: Multimodal models incentivized to produce long intermediate reasoning chains (CoTs) before generating final outputs

Intermediate CoT: The step-by-step reasoning text generated by the model before the final answer, used to explain the decision process

Hallucination: Content in generated text that is inconsistent with factual knowledge, multimodal evidence, or logical context

Rubric-based evaluation: An assessment method where an LLM judge scores model outputs against specific, pre-defined criteria (rubrics) for distinct capabilities

IoU: Intersection over Union—a metric used in grounding tasks to measure the overlap between a predicted bounding box and the ground truth box

Cognitive Dimensions: The three top-level categories in the paper's taxonomy: Knowledge (facts), Perception (visual/audio sensing), and Reasoning (logic)

H-score: Hallucination-free score, calculated as 1 minus the ratio of hallucinated content, quantifying the absence of errors