Guardian is an end-to-end multi-LLM pipeline that uses consensus mechanisms and role-specialized agents to extract reliable, structured intelligence from unstructured missing-person case narratives.
Core Problem
Early-stage missing-person investigations rely on manual fusion of sparse, heterogeneous, and rapidly evolving data (reports, tips, maps), making it difficult to produce calibrated uncertainty and actionable search plans quickly.
Why it matters:
The first 72 hours are critical for recovery, yet traditional planning relies on coarse heuristics and human judgment rather than probabilistic modeling.
Single-model LLM approaches are insufficient because individual models are fallible experts; reliance on a single output can lead to hallucinations or malformed data in safety-critical contexts.
Downstream analytics like mobility forecasting and hotspot detection require stable, schema-aligned inputs, which raw narrative extractions often fail to provide.
Concrete Example:In a missing-child case, extraction candidates might contain malformed JSON or factually disagree on the 'last seen' location. A single model might hallucinate a location or break the schema, whereas Guardian's consensus engine detects the disagreement, enforces the schema, and merges valid signals.
Key Novelty
Consensus-Driven Multi-LLM Reliability Layer
Treats reliability as a pipeline property rather than a model score: extraction is performed by multiple 'fallible expert' models (e.g., Qwen, Llama) running in parallel.
Routes all predictions through a centralized consensus engine (Gemini-based) that enforces schema constraints, resolves disagreements via voting/adjudication, and repairs malformed outputs.
Uses QLoRA fine-tuned models as interchangeable specialist backends, decoupling the generation of candidates from the adjudication of their validity.
Architecture
The Guardian LLM Pipeline architecture, illustrating the flow from case narrative through orchestration to model execution.
Evaluation Highlights
Deployed on a distributed Google Cloud configuration with 3 task-specific VMs (Extractor, Summarizer, Weak-labeler) running 6 concurrent inference servers.
Successfully integrated Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models as parallel candidate generators behind a Gemini 2.5 consensus layer.
Demonstrates structural reliability (schema conformity) and factual supportability through automated JSON repair and cross-model agreement checks.
Breakthrough Assessment
7/10
Significant practical application of multi-agent consensus for high-stakes, time-sensitive domains. While the underlying LLMs are standard, the architectural emphasis on reliability-through-consensus and system-level auditing is a strong contribution to Applied AI.
⚙️ Technical Details
Problem Definition
Setting: End-to-end information extraction and decision support for missing-person search planning
Inputs: Heterogeneous raw inputs (narrative reports, PDFs, public tips, transit data, maps)
Outputs: Probabilistic search surfaces, ranked sectors, hotspots, and containment rings for 24-, 48-, and 72-hour horizons
Pipeline Flow
Input Processing: Case Narrative (Loading/Parsing) → Orchestrator
Compute: 3 distributed Google Cloud GPU VMs; Dockerized vLLM servers; Qwen on port 8001, Llama on port 8002
Comparison to Prior Work
vs. Standard Entity Extraction: Guardian adds a consensus layer to mediate between multiple extraction models rather than trusting one.
vs. Standard Weak Supervision: Guardian uses LLMs as 'controlled labelers' with strict schema enforcement and auditing, rather than just for generating training data.
vs. Traditional Search Planning: Replaces manual fusion of heuristics with a probabilistic, automated pipeline that generates calibrated uncertainty surfaces.
Limitations
Evaluation is qualitative and diagnostic rather than benchmark-based (accuracy metrics not reported).
Relies on synthetic data for evaluation to protect sensitive real-world case information.
Consensus logic depends on the superior capability of the referee model (Gemini), which could introduce a single point of failure or bias.
Reproducibility
No replication artifacts mentioned in the paper (code, data, or weights not provided).
📊 Experiments & Results
Evaluation Setup
Qualitative and diagnostic analysis of pipeline behavior under realistic operating conditions using synthetic/semi-structured missing-child case narratives.
Factual correctness (relative to synthetic ground truth)
Cross-model consistency
Reliability (valid outputs under disagreement)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Detail of the Consensus Engine workflow.
Representative JSON output excerpt produced by the Llama LLM.
Representative JSON output excerpt produced by the Gemini-based consensus LLM.
Main Takeaways
Reliability is framed as a system property: the pipeline produces valid outputs even when individual models disagree or fail, thanks to the consensus layer.
The system successfully integrates heterogeneous models (Qwen, Llama, Gemini) to balance local specialization with cloud-based reasoning.
Strict prompting constraints (Format-Guard prompts) combined with deterministic repair mechanisms effectively handle the instability of LLM outputs.
The architecture prioritizes conservative, auditable outputs over raw predictive power, aligning with ethical requirements in safety-critical domains.
📚 Prerequisite Knowledge
Prerequisites
Large Language Models (LLMs)
Weak Supervision
Retrieval-Augmented Generation / Information Extraction pipelines
Consensus Engine: A centralized module that compares outputs from multiple models to resolve disagreements and enforce schema validity
QLoRA: Quantized Low-Rank Adaptation—a parameter-efficient fine-tuning method that updates only a small set of parameters on top of a quantized base model
Weak Labeling: Using noisy or probabilistic signals (from LLMs) to generate training data or structured tags when ground truth is scarce
Zone QA: Quality Assurance module that applies LLM-assisted plausibility scoring to candidate search zones to reweight priorities
RL: Reinforcement Learning—referenced here as a source of auxiliary zone scores used in the Zone QA module
JSON Repair: A deterministic or model-based process to fix malformed JSON outputs so they conform to a required schema
Orchestrator: A system module that coordinates parallel execution, manages deadlines, and handles caching across different model tasks