Evaluation Setup
Qualitative and diagnostic analysis of pipeline behavior under realistic operating conditions using synthetic/semi-structured missing-child case narratives.
Metrics:
- Structural correctness (parseable, schema-aligned)
- Factual correctness (relative to synthetic ground truth)
- Cross-model consistency
- Reliability (valid outputs under disagreement)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Reliability is framed as a system property: the pipeline produces valid outputs even when individual models disagree or fail, thanks to the consensus layer.
- The system successfully integrates heterogeneous models (Qwen, Llama, Gemini) to balance local specialization with cloud-based reasoning.
- Strict prompting constraints (Format-Guard prompts) combined with deterministic repair mechanisms effectively handle the instability of LLM outputs.
- The architecture prioritizes conservative, auditable outputs over raw predictive power, aligning with ethical requirements in safety-critical domains.