A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations

📝 Paper Summary

Multi-agent Agentic AI Weak Supervision

Guardian is an end-to-end multi-LLM pipeline that uses consensus mechanisms and role-specialized agents to extract reliable, structured intelligence from unstructured missing-person case narratives.

Core Problem

Early-stage missing-person investigations rely on manual fusion of sparse, heterogeneous, and rapidly evolving data (reports, tips, maps), making it difficult to produce calibrated uncertainty and actionable search plans quickly.

Why it matters:

The first 72 hours are critical for recovery, yet traditional planning relies on coarse heuristics and human judgment rather than probabilistic modeling.
Single-model LLM approaches are insufficient because individual models are fallible experts; reliance on a single output can lead to hallucinations or malformed data in safety-critical contexts.
Downstream analytics like mobility forecasting and hotspot detection require stable, schema-aligned inputs, which raw narrative extractions often fail to provide.

Concrete Example: In a missing-child case, extraction candidates might contain malformed JSON or factually disagree on the 'last seen' location. A single model might hallucinate a location or break the schema, whereas Guardian's consensus engine detects the disagreement, enforces the schema, and merges valid signals.

Key Novelty

Consensus-Driven Multi-LLM Reliability Layer

Treats reliability as a pipeline property rather than a model score: extraction is performed by multiple 'fallible expert' models (e.g., Qwen, Llama) running in parallel.
Routes all predictions through a centralized consensus engine (Gemini-based) that enforces schema constraints, resolves disagreements via voting/adjudication, and repairs malformed outputs.
Uses QLoRA fine-tuned models as interchangeable specialist backends, decoupling the generation of candidates from the adjudication of their validity.

Architecture

The Guardian LLM Pipeline architecture, illustrating the flow from case narrative through orchestration to model execution.

Evaluation Highlights

Deployed on a distributed Google Cloud configuration with 3 task-specific VMs (Extractor, Summarizer, Weak-labeler) running 6 concurrent inference servers.
Successfully integrated Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models as parallel candidate generators behind a Gemini 2.5 consensus layer.
Demonstrates structural reliability (schema conformity) and factual supportability through automated JSON repair and cross-model agreement checks.

Breakthrough Assessment

7/10

Significant practical application of multi-agent consensus for high-stakes, time-sensitive domains. While the underlying LLMs are standard, the architectural emphasis on reliability-through-consensus and system-level auditing is a strong contribution to Applied AI.

⚙️ Technical Details

Problem Definition

Setting: End-to-end information extraction and decision support for missing-person search planning

Inputs: Heterogeneous raw inputs (narrative reports, PDFs, public tips, transit data, maps)

Outputs: Probabilistic search surfaces, ranked sectors, hotspots, and containment rings for 24-, 48-, and 72-hour horizons

Pipeline Flow

Input Processing: Case Narrative (Loading/Parsing) → Orchestrator
Generation: Orchestrator → Specialist Models (Summarizer, Extractor, Weak Labeler)
Reliability: Candidate Outputs → Consensus Engine (Normalization, Agreement, Repair) → Validated Output
Downstream: Validated Output → Zone QA → Search Plan Generation

System Modules

Case Narrative / Parser Pack

Ingest raw inputs, normalize extracted fields, and enrich cases with external contextual data

Model or implementation: Not specified (Rule-based/heuristic)

Specialist LLMs (Summarizer, Extractor, Weak Labeler)

Generate primary artifacts: summaries, schema-aligned entity extractions, and weak labels for movement/risk

Model or implementation: Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct (running concurrently)

Consensus Engine

Enforce schema conformity, compute agreement, and resolve disagreements via referee prompts

Model or implementation: Gemini 2.5 Flash/Pro

Zone QA

Apply plausibility scoring to candidate zones and reweight priorities

Model or implementation: LLM-assisted scoring (Model details implied as part of Core)

Novel Architectural Elements

Centralized multi-model consensus layer that treats reliability as an operational pipeline property (normalization → agreement → repair → referee)
Dual-mode execution strategy: Stage-by-stage (throughput) vs. Case-by-case (debugging)
Integration of fine-tuned local models (Qwen/Llama) as 'fallible experts' whose outputs are adjudicated by a stronger cloud model (Gemini)

Modeling

Base Model: Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct (Local Specialists); Gemini 2.5 Flash/Pro (Consensus)

Training Method: QLoRA (Quantized Low-Rank Adaptation)

Adaptation: Low-rank adapter parameters on top of quantized base model

Training Data:

Curated datasets reflecting Guardian’s operational tasks

Compute: 3 distributed Google Cloud GPU VMs; Dockerized vLLM servers; Qwen on port 8001, Llama on port 8002

Comparison to Prior Work

vs. Standard Entity Extraction: Guardian adds a consensus layer to mediate between multiple extraction models rather than trusting one.
vs. Standard Weak Supervision: Guardian uses LLMs as 'controlled labelers' with strict schema enforcement and auditing, rather than just for generating training data.
vs. Traditional Search Planning: Replaces manual fusion of heuristics with a probabilistic, automated pipeline that generates calibrated uncertainty surfaces.

Limitations

Evaluation is qualitative and diagnostic rather than benchmark-based (accuracy metrics not reported).
Relies on synthetic data for evaluation to protect sensitive real-world case information.
Consensus logic depends on the superior capability of the referee model (Gemini), which could introduce a single point of failure or bias.

Reproducibility

No replication artifacts mentioned in the paper (code, data, or weights not provided).

📊 Experiments & Results

Evaluation Setup

Qualitative and diagnostic analysis of pipeline behavior under realistic operating conditions using synthetic/semi-structured missing-child case narratives.

Metrics:

Structural correctness (parseable, schema-aligned)
Factual correctness (relative to synthetic ground truth)
Cross-model consistency
Reliability (valid outputs under disagreement)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Detail of the Consensus Engine workflow.

Representative JSON output excerpt produced by the Llama LLM.

Representative JSON output excerpt produced by the Gemini-based consensus LLM.

Main Takeaways

Reliability is framed as a system property: the pipeline produces valid outputs even when individual models disagree or fail, thanks to the consensus layer.
The system successfully integrates heterogeneous models (Qwen, Llama, Gemini) to balance local specialization with cloud-based reasoning.
Strict prompting constraints (Format-Guard prompts) combined with deterministic repair mechanisms effectively handle the instability of LLM outputs.
The architecture prioritizes conservative, auditable outputs over raw predictive power, aligning with ethical requirements in safety-critical domains.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs)
Weak Supervision
Retrieval-Augmented Generation / Information Extraction pipelines
Geospatial analysis concepts (hotspots, containment rings)

Key Terms

Consensus Engine: A centralized module that compares outputs from multiple models to resolve disagreements and enforce schema validity

QLoRA: Quantized Low-Rank Adaptation—a parameter-efficient fine-tuning method that updates only a small set of parameters on top of a quantized base model

Weak Labeling: Using noisy or probabilistic signals (from LLMs) to generate training data or structured tags when ground truth is scarce

Zone QA: Quality Assurance module that applies LLM-assisted plausibility scoring to candidate search zones to reweight priorities

RL: Reinforcement Learning—referenced here as a source of auxiliary zone scores used in the Zone QA module

JSON Repair: A deterministic or model-based process to fix malformed JSON outputs so they conform to a required schema

Orchestrator: A system module that coordinates parallel execution, manages deadlines, and handles caching across different model tasks