M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

📝 Paper Summary

Visual Mathematical Reasoning Multi-Agent Systems Context Engineering

M3-ACE improves visual math reasoning by decoupling perception from reasoning and using a multi-agent framework to iteratively cross-validate and refine visual evidence extraction without model training.

Core Problem

Multimodal Large Language Models (MLLMs) frequently fail at visual math problems due to inaccurate visual perception (incorrect evidence extraction) rather than flawed reasoning logic.

Why it matters:

Models exhibit high reasoning trajectory accuracy (~90%) but low visual evidence accuracy (~60%), making perception the dominant bottleneck
Single-model self-correction fails due to confirmation bias; models remain overconfident in initial wrong perceptions even when prompted to reflect
Providing the correct final answer does not help models recover correct visual evidence, creating a one-way dependency where correct perception is a strict prerequisite

Concrete Example: In a geometry problem, a model might correctly plan to use the Pythagorean theorem (valid reasoning) but incorrectly perceive a triangle's side length as 3 instead of 4 (perception error). Because the reasoning trace is logically sound based on the wrong input, the model's self-reflection confirms the error rather than fixing it.

Key Novelty

Multi-Agentic Context Engineering (M3-ACE)

Explicitly decouples the 'Visual Evidence List' (perception) from the final reasoning process, treating perception as a distinct upstream task
Uses multiple heterogeneous agents to generate diverse observation lists, breaking the confirmation bias inherent in single-model reflection
Employs lightweight 'Summary' and 'Refine' tools to cluster evidence into consistent/conflicting groups and filter out unreliable perceptual facts before reasoning begins

Architecture

Conceptual workflow of the M3-ACE framework (implied from text description of 'M3-Agent' and 'Figure 2' description which covers the diagnosis)

Evaluation Highlights

Achieves 89.1% accuracy on the MathVision benchmark, establishing a new state-of-the-art
Demonstrates that correcting Visual Evidence (VE) alone raises answer accuracy to ~88.5%, whereas models only achieve ~12% VE accuracy on initially incorrect answers
Outperforms Qwen3.5 (78.9%) and GPT-4o (30.39%) baselines on MathVision competition-level problems

Breakthrough Assessment

8/10

Identifies the precise bottleneck (perception vs. reasoning) with strong empirical backing and provides a training-free, agentic solution that yields significant gains on difficult benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering (VQA) in the domain of Mathematics

Inputs: Image I and Problem Text Q

Outputs: Final Answer A

Pipeline Flow

Group: Perception Phase: Multiple Agents → Shared Context → Summary Tool → Refine Tool
Group: Reasoning Phase: Solver Agent (uses Refined VE)

System Modules

Assistant Agents (Perception Phase)

Generate initial lists of Visual Evidence (VE) from the image

Model or implementation: Gemini-2.5 Pro / GPT-5 (implied base models)

Summary Tool (Perception Phase)

Organize evidence from different agents into categories

Model or implementation: LLM-based Tool

Refine Tool (Perception Phase)

Filter unreliable samples and guide iterative correction of the evidence list

Model or implementation: LLM-based Tool

Solver Agent

Execute mathematical reasoning using the verified visual evidence

Model or implementation: Gemini-2.5 Pro / GPT-5 (implied base models)

Novel Architectural Elements

Shared Visual Evidence Context: A dynamic memory structure centered on VE lists rather than conversational history
Perception-Reasoning Decoupling: Explicit architectural separation where reasoning only begins after VE convergence

Modeling

Base Model: Gemini-2.5 Pro and GPT-5 (used as backbones for agents)

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoT: CoT entangles perception and reasoning, leading to hallucinated evidence; M3-ACE decouples them.
vs. Reflexion: Single-agent Reflexion suffers from confirmation bias (can't see its own perception errors); M3-ACE uses multi-agent disagreement to force valid verification.
vs. MathVista baselines: Prior methods use fine-tuning or RL; M3-ACE uses inference-time context engineering.

Limitations

Relies on the capabilities of proprietary foundation models (Gemini-2.5 Pro, GPT-5).
Inference cost is higher due to multi-agent communication and multiple API calls per problem.
Effectiveness depends on the existence of visual evidence; less effective if the problem is purely textual or requires no visual parsing.

Reproducibility

Code availability is not provided in the snippet. The method relies on prompt/context engineering of closed-source APIs (Gemini/GPT), meaning exact reproduction requires the specific prompts which are described conceptually but not linked as a file.

📊 Experiments & Results

Evaluation Setup

Multimodal mathematical reasoning across varying difficulty levels (foundation to competition level)

Benchmarks:

MathVision (Competition-level visual math problems (AMC, AIME))
MathVista (General visual math reasoning)
MathVerse (Visual ablation benchmark for diagnosing modality usage)

Metrics:

Accuracy (%)
Visual Evidence (VE) Accuracy (%)
Trajectory Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MathVision	Accuracy	78.9	89.1	+10.2
Diagnostic experiments reveal the hierarchy of failure modes: Reasoning (Trajectory) is strong, Perception (VE) is weak.
Mini Benchmark (Internal)	Trajectory Accuracy	Not applicable	90.0	Not applicable
Mini Benchmark (Internal)	Visual Evidence (VE) Accuracy	Not applicable	60.0	Not applicable
Mini Benchmark (Error Subset)	VE Accuracy (Incorrect Answers)	Not applicable	12.0	Not applicable
Self-correction experiments show that single models cannot fix perception errors, even with strong hints.
Mini Benchmark (Error Subset)	Accuracy (2nd Round)	Not applicable	30.0	Not applicable
Mini Benchmark (Error Subset)	Accuracy (2nd Round)	30.0	88.5	+58.5

Experiment Figures

Diagnostic breakdown of failure modes and self-correction capabilities

Main Takeaways

Visual Evidence (VE) extraction is the primary bottleneck in visual math reasoning; reasoning trajectories are usually correct even when the final answer is wrong.
A strong asymmetry exists: Correct VE leads to correct answers, but providing the correct answer does not help the model recover correct VE (it cannot 'see' backwards).
Single-model self-reflection is ineffective for perception errors due to confirmation bias; explicit multi-agent disagreement is required to trigger genuine re-perception.
Context engineering (M3-ACE) can achieve state-of-the-art results without fine-tuning by restructuring the inference process to prioritize perceptual verification.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) / Multimodal LLMs
Chain-of-Thought (CoT) Prompting
Agentic workflows (multi-agent collaboration)

Key Terms

Visual Evidence (VE): Structured perceptual facts extracted from an image (e.g., 'side AB = 5', 'angle C is 90 degrees') that serve as the ground truth for reasoning

Context Engineering: Optimizing model performance by structuring the prompt context and interaction protocol rather than updating model weights (training)

Decoupling Principle: Separating the task of extracting visual facts from the task of logical reasoning to prevent reasoning hallucinations from corrupting perception

MathVision: A challenging benchmark for visual mathematical reasoning sourced from math competitions like AMC and AIME

Modality Laziness: The tendency of multimodal models to rely on text redundancy while ignoring or underutilizing visual information

SOTA: State-of-the-art—the current best performance achieved by any method