← Back to Paper List

A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations

Joshua Castillo, Ravi Mukkamala
arXiv (2026)
Agent Factuality KG

📝 Paper Summary

Multi-agent Agentic AI Weak Supervision
Guardian is an end-to-end multi-LLM pipeline that uses consensus mechanisms and role-specialized agents to extract reliable, structured intelligence from unstructured missing-person case narratives.
Core Problem
Early-stage missing-person investigations rely on manual fusion of sparse, heterogeneous, and rapidly evolving data (reports, tips, maps), making it difficult to produce calibrated uncertainty and actionable search plans quickly.
Why it matters:
  • The first 72 hours are critical for recovery, yet traditional planning relies on coarse heuristics and human judgment rather than probabilistic modeling.
  • Single-model LLM approaches are insufficient because individual models are fallible experts; reliance on a single output can lead to hallucinations or malformed data in safety-critical contexts.
  • Downstream analytics like mobility forecasting and hotspot detection require stable, schema-aligned inputs, which raw narrative extractions often fail to provide.
Concrete Example: In a missing-child case, extraction candidates might contain malformed JSON or factually disagree on the 'last seen' location. A single model might hallucinate a location or break the schema, whereas Guardian's consensus engine detects the disagreement, enforces the schema, and merges valid signals.
Key Novelty
Consensus-Driven Multi-LLM Reliability Layer
  • Treats reliability as a pipeline property rather than a model score: extraction is performed by multiple 'fallible expert' models (e.g., Qwen, Llama) running in parallel.
  • Routes all predictions through a centralized consensus engine (Gemini-based) that enforces schema constraints, resolves disagreements via voting/adjudication, and repairs malformed outputs.
  • Uses QLoRA fine-tuned models as interchangeable specialist backends, decoupling the generation of candidates from the adjudication of their validity.
Evaluation Highlights
  • Deployed on a distributed Google Cloud configuration with 3 task-specific VMs (Extractor, Summarizer, Weak-labeler) running 6 concurrent inference servers.
  • Successfully integrated Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models as parallel candidate generators behind a Gemini 2.5 consensus layer.
  • Demonstrates structural reliability (schema conformity) and factual supportability through automated JSON repair and cross-model agreement checks.
Breakthrough Assessment
7/10
Significant practical application of multi-agent consensus for high-stakes, time-sensitive domains. While the underlying LLMs are standard, the architectural emphasis on reliability-through-consensus and system-level auditing is a strong contribution to Applied AI.
×