← Back to Paper List

Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

Kangjun Noh, Seongchan Lee, Ilmun Kim, Kyungwoo Song
Department of Applied Statistics and Data Science, Yonsei University, Department of Mathematical Sciences, KAIST
arXiv (2026)
Factuality Benchmark

📝 Paper Summary

Hallucination suppression Confidence calibration
MACI is a conformal inference framework that uses multiplicative filtering and multi-LLM ensembles to guarantee factuality across different subgroups while retaining significantly more true claims than prior methods.
Core Problem
Existing conformal inference methods for LLMs are either too conservative (discarding many true claims to ensure safety) or fail to provide strict statistical guarantees across diverse subgroups.
Why it matters:
  • High-stakes domains like medicine and law require strict statistical guarantees of factuality, not just general improvements.
  • Current methods often rely on a single 'worst-case' score or global thresholds, which ignores the collective confidence of claims and leads to low information retention.
  • Standard marginal coverage guarantees allow for dangerous biases where specific subgroups (e.g., certain medical topics) might have high error rates even if the global average is safe.
Concrete Example: In a medical QA scenario, a standard conformal method might discard an entire accurate explanation about 'kidney failure' because one minor detail had a low score, or it might guarantee 90% accuracy globally while being only 60% accurate on 'pediatric' questions. MACI ensures 90% accuracy specifically within the 'pediatric' group while keeping more valid sentences.
Key Novelty
Multi-LLM Adaptive Conformal Inference (MACI)
  • Reformulates factuality filtering as a cumulative product of claim probabilities rather than a single worst-case threshold, allowing for more nuanced retention decisions.
  • Integrates a multi-LLM ensemble to estimate factuality scores, theoretically proving that better score estimation directly translates to higher claim retention under the same safety constraints.
  • Applies group-conditional calibration to ensure validity holds within specific semantic clusters (e.g., topic or entity type) rather than just on average.
Architecture
Architecture Figure Algorithm 2
The complete MACI algorithm flow, integrating the ensemble weight optimization and the group-conditional calibration.
Evaluation Highlights
  • Achieves higher retention ratio (keeping more true claims) than baseline BCI and localized conformal prediction across multiple datasets (e.g., Biography, medical QA).
  • Strictly maintains user-specified error rates (e.g., alpha=0.1) within specific subgroups, whereas baselines often violate coverage requirements in difficult groups.
  • Ensemble approach reduces the gap between estimated and oracle factuality scores, empirically validating the theoretical link between estimation error and retention efficiency.
Breakthrough Assessment
8/10
Strong theoretical contribution linking estimation error to retention in conformal inference, combined with a practical algorithm that solves the 'conservativeness' problem of previous methods. Directly addresses the trade-off between safety and utility.
×