Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution

📝 Paper Summary

Multi-agent Agentic RAG pipeline

VMAO coordinates specialized agents via a directed acyclic graph plan and an independent verification loop that detects missing information and triggers adaptive replanning before final synthesis.

Core Problem

Existing multi-agent frameworks lack principled quality verification at the orchestration level; they often fail to detect when sub-tasks are incomplete or errored before synthesizing a final answer.

Why it matters:

Complex domains like market research require aggregating scattered data (financial, operational, competitive); missing one aspect invalidates the whole analysis
Current systems rely on debate or role-play which improve reasoning but don't explicitly verify task completeness against the original plan
Production systems need reliability without constant human oversight, requiring automated mechanisms to decide when to stop iterating

Concrete Example: In a query requiring financial, operational, and competitive data, a static pipeline might fail to retrieve financial metrics due to a tool error but proceed to synthesis anyway, resulting in a partial answer. VMAO's verifier detects the missing financial node and triggers a targeted retry.

Key Novelty

Verified Multi-Agent Orchestration (VMAO)

Verification-Driven Replanning: Uses an independent LLM to evaluate if agent outputs satisfy the DAG plan, acting as a coordination signal decoupled from agent implementation
DAG-Based Context Propagation: Decomposes queries into a dependency graph where upstream results are automatically passed to downstream agents, enabling efficient parallel execution
Hierarchical Synthesis: Handles large result sets by summarizing within agent groups before creating the final answer, ensuring source attribution is preserved

Architecture

The five-phase VMAO workflow: Plan → Execute → Verify → Replan → Synthesize

Evaluation Highlights

+1.1 point improvement in Answer Completeness (4.2 vs 3.1 on 1-5 scale) compared to Single-Agent baseline on market research queries
+1.5 point improvement in Source Quality (4.1 vs 2.6 on 1-5 scale), indicating significantly better citation and traceability
+53% improvement in completeness specifically for Strategic Assessment queries, which require synthesizing information across multiple dimensions

Breakthrough Assessment

7/10

Strong engineering framework addressing the critical reliability gap in multi-agent systems via explicit verification. While methodologically straightforward, the implementation and specific application to complex research demonstrate significant practical value.

⚙️ Technical Details

Problem Definition

Setting: Complex query resolution requiring multi-hop information gathering and synthesis from heterogeneous sources

Inputs: Complex natural language query (e.g., market research question)

Outputs: Synthesized final answer with source attribution

Pipeline Flow

Query Planner (Decomposes query into DAG)
DAG Executor (Runs agents in parallel batches)
Result Verifier (Checks completeness/contradictions)
Adaptive Replanner (Decides to retry, add questions, or merge)
Hierarchical Synthesizer (Produces final report)

System Modules

Query Planner

Decompose complex query into sub-questions with dependencies and agent assignments

Model or implementation: Claude Sonnet 4.5

DAG Executor

Execute sub-questions respecting dependencies; handles context propagation

Model or implementation: Claude Sonnet 4.5 (Primary), Claude Haiku 4.5 (Fallback)

Result Verifier

Evaluate result completeness against sub-question requirements

Model or implementation: Claude Opus 4.5

Adaptive Replanner

Determine corrective actions for gaps identified by verifier

Model or implementation: Claude Sonnet 4.5

Hierarchical Synthesizer

Aggregates results into final answer

Model or implementation: Claude Sonnet 4.5

Novel Architectural Elements

Orchestration-level verification loop: Decouples the 'done' signal from the working agents by using a separate verifier to trigger replanning
Dependency-aware context propagation: Automatically prepends upstream results to downstream sub-questions within the DAG execution

Modeling

Base Model: Claude Sonnet 4.5 (Execution), Claude Opus 4.5 (Verification)

Comparison to Prior Work

vs. AutoGen: VMAO uses explicit DAG planning and verification rather than conversational turns
vs. MetaGPT: VMAO allows dynamic replanning based on result quality rather than static SOPs
vs. Deep Research: VMAO provides an open, modular framework with configurable stop conditions [not cited in paper]
+ 1 more
vs. ReAct: VMAO coordinates multiple specialized agents rather than a single agent loop

Limitations

Verification evaluates completeness rather than factual accuracy; hallucinations may persist if they look complete
High token cost: 8.5x more tokens (850K) compared to Single-Agent (100K) due to verification overhead
Evaluation uses a model from the same family (Claude Opus judging Claude Sonnet outputs), potentially introducing bias
Dependency on initial decomposition: if the planner misframes the problem, verification may validate irrelevant answers

📊 Experiments & Results

Evaluation Setup

25 expert-curated market research queries across 4 categories (Performance, Competitive, Financial, Strategic)

Benchmarks:

Market Research Queries (Complex Query Resolution) [New]

Metrics:

Completeness (1-5 scale)
Source Quality (1-5 scale)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VMAO outperforms Single-Agent and Static Pipeline baselines across both quality metrics.
Market Research Queries	Completeness (1-5)	3.1	4.2	+1.1
Market Research Queries	Source Quality (1-5)	2.6	4.1	+1.5
Strategic Assessment Queries	Completeness Improvement	Not reported in the paper	Not reported in the paper	+53%

Experiment Figures

Token distribution across orchestration phases

Main Takeaways

Orchestration-level verification significantly improves completeness and source quality by catching gaps that single agents or static pipelines miss
The largest gains (+53%) occur in open-ended 'Strategic Assessment' queries, while gains are smaller for well-defined 'Performance Analysis' queries
Most queries (>75%) terminate via resource-based stop conditions (diminishing returns, token budget), indicating the system effectively trades off cost vs. quality
Replanning primarily triggers 'retries' of existing questions rather than generating new ones, suggesting execution variance (tool failure) is a bigger issue than initial planning

📚 Prerequisite Knowledge

Prerequisites

Multi-agent system architectures
Retrieval-Augmented Generation (RAG)
Directed Acyclic Graphs (DAGs)

Key Terms

DAG: Directed Acyclic Graph—a structure used here to organize sub-questions where edges represent dependencies (one question must be answered before another)

MCP: Model Context Protocol—an open standard used in this paper to expose tools (like web search or database access) to agents via standardized HTTP microservices

RAG: Retrieval-Augmented Generation—providing grounded information by retrieving from external sources to answer questions

VMAO: Verified Multi-Agent Orchestration—the proposed framework using a Plan-Execute-Verify-Replan loop

SFT: Supervised Fine-Tuning—training a model on labeled examples (mentioned in baselines/context)

Sonnet 4.5: Claude 3.5 Sonnet—the specific LLM used for agent execution in this paper

Opus 4.5: Claude 3.5 Opus—the specific LLM used for verification and evaluation in this paper