Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

📝 Paper Summary

Hallucination suppression Confidence calibration

MACI is a conformal inference framework that uses multiplicative filtering and multi-LLM ensembles to guarantee factuality across different subgroups while retaining significantly more true claims than prior methods.

Core Problem

Existing conformal inference methods for LLMs are either too conservative (discarding many true claims to ensure safety) or fail to provide strict statistical guarantees across diverse subgroups.

Why it matters:

High-stakes domains like medicine and law require strict statistical guarantees of factuality, not just general improvements.
Current methods often rely on a single 'worst-case' score or global thresholds, which ignores the collective confidence of claims and leads to low information retention.
Standard marginal coverage guarantees allow for dangerous biases where specific subgroups (e.g., certain medical topics) might have high error rates even if the global average is safe.

Concrete Example: In a medical QA scenario, a standard conformal method might discard an entire accurate explanation about 'kidney failure' because one minor detail had a low score, or it might guarantee 90% accuracy globally while being only 60% accurate on 'pediatric' questions. MACI ensures 90% accuracy specifically within the 'pediatric' group while keeping more valid sentences.

Key Novelty

Multi-LLM Adaptive Conformal Inference (MACI)

Reformulates factuality filtering as a cumulative product of claim probabilities rather than a single worst-case threshold, allowing for more nuanced retention decisions.
Integrates a multi-LLM ensemble to estimate factuality scores, theoretically proving that better score estimation directly translates to higher claim retention under the same safety constraints.
Applies group-conditional calibration to ensure validity holds within specific semantic clusters (e.g., topic or entity type) rather than just on average.

Architecture

The complete MACI algorithm flow, integrating the ensemble weight optimization and the group-conditional calibration.

Evaluation Highlights

Achieves higher retention ratio (keeping more true claims) than baseline BCI and localized conformal prediction across multiple datasets (e.g., Biography, medical QA).
Strictly maintains user-specified error rates (e.g., alpha=0.1) within specific subgroups, whereas baselines often violate coverage requirements in difficult groups.
Ensemble approach reduces the gap between estimated and oracle factuality scores, empirically validating the theoretical link between estimation error and retention efficiency.

Breakthrough Assessment

8/10

Strong theoretical contribution linking estimation error to retention in conformal inference, combined with a practical algorithm that solves the 'conservativeness' problem of previous methods. Directly addresses the trade-off between safety and utility.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc filtering of LLM-generated claims to ensure statistical factuality guarantees.

Inputs: A document D consisting of a prompt P and a set of generated claims C = {c1, ..., cN}.

Outputs: A subset of claims F(C) such that the probability of all retained claims being true is at least 1 - alpha.

Pipeline Flow

Claim Decomposition: Split LLM response into atomic claims
Scoring Ensemble: Multiple LLMs score each claim's factuality
Conformity Scoring: Calculate cumulative product scores for calibration
Group Calibration: Determine thresholds per group (e.g., topic)
Filtering: Remove claims below the calibrated threshold

System Modules

Claim Decomposer

Breaks down a long-form generation into individual atomic claims for granular scoring.

Model or implementation: Not explicitly specified (likely heuristic or LLM-based)

Factuality Scorer Ensemble

Estimates the probability that each claim is true using a weighted ensemble of multiple LLMs.

Model or implementation: Ensemble of varying LLMs (weights optimized via linear programming)

Multiplicative Filter

Calculates conformity scores based on the cumulative product of claim probabilities and applies the group-specific threshold.

Model or implementation: Mathematical operator (Cumulative Product)

Novel Architectural Elements

Multiplicative conformity score formulation: treats claim retention as a cumulative product problem to maximize retention theoretical bounds.
Ensemble-based conformal calibration: explicitly integrates multi-model outputs into the conformal scoring pipeline to reduce estimation error.

Modeling

Base Model: Ensemble of LLMs (specific models used in experiments not detailed in text, generic framework)

Training Method: Linear Programming for Ensemble Weights

Objective Functions:

Purpose: Optimize ensemble weights to minimize False Positive Rate while maintaining True Positive Rate.

Formally: minimize sum(w_m * FPR_m) subject to sum(w_m * TPR_m) >= 1 - delta.

Adaptation: None (Optimization is on weights w, not model parameters)

Trainable Parameters: Ensemble weights vector w (M dimensions)

Training Data:

Calibration set used to compute empirical FPR/TPR for weight optimization

Key Hyperparameters:

delta: User-specified tolerance for TPR constraint
alpha: User-specified error rate (e.g., 0.1)

Compute: Inference cost scales linearly with number of models in ensemble (M).

Comparison to Prior Work

vs. BCI: MACI provides group-conditional validity instead of just marginal validity and uses multiplicative scoring for higher retention.
vs. Cherian et al. (2024): MACI guarantees fixed error rates (crucial for high-stakes) rather than adaptive ones and handles complex groupings explicitly.
vs. LCP [not cited in paper]: Standard LCP is computationally expensive and data-hungry; MACI uses discrete grouping for efficiency.
+ 1 more
vs. Self-Consistency [not cited in paper]: Sampling-based methods are slow and lack statistical guarantees; MACI is distribution-free and guaranteed.

Limitations

Requires an initial calibration set with ground truth labels.
Computational cost increases with the number of LLMs in the ensemble.
Grouping function g must be defined a priori (e.g., by domain or topic).
Performance depends on the quality of the base factuality estimators.

Reproducibility

Code: https://github.com/MLAI-Yonsei/MACI.git

Code is publicly available at https://github.com/MLAI-Yonsei/MACI.git. The paper describes the algorithm and theoretical proofs in detail (Appendices). Specific LLM checkpoints for the ensemble in the experiments are not listed in the main text summary provided.

📊 Experiments & Results

Evaluation Setup

Factuality filtering on LLM-generated texts across different domains.

Benchmarks:

Biography (Open-ended generation)
Medical QA (generic) (Domain-specific QA)

Metrics:

Retention Ratio (proportion of claims kept)
Empirical Coverage (actual valid rate vs target)
Set Size
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Visual evidence from Figure 1 suggests MACI significantly outperforms baselines in keeping relevant information while removing hallucinations.
Example Response	Retention quality	Low retention (many true claims cut)	High retention (retains true claims)	Qualitative improvement

Experiment Figures

A comparison of filtered outputs between Basic Conformal Inference and MACI on a sample text.

Main Takeaways

MACI consistently achieves the user-specified coverage (validity) across diverse datasets, unlike baselines which may violate it in hard subgroups.
The multiplicative filtering approach yields substantially higher retention ratios than single-score thresholding methods.
Using an ensemble of LLMs for scoring improves the quality of the conformity score, which theoretically and empirically increases retention efficiency.
Group-conditional calibration effectively handles heterogeneity in data (e.g., varying difficulty across medical topics).

📚 Prerequisite Knowledge

Prerequisites

Conformal Prediction / Conformal Inference
Probability theory (conditional probability, quantiles)
Large Language Models (basic generation and scoring)

Key Terms

conformal inference: A statistical framework that constructs prediction sets (or filters) with a guaranteed probability of containing the true label (or being factual) regardless of the underlying distribution.

factuality-score: A probability score assigned to an individual claim indicating the likelihood that the claim is factual.

retention ratio: The proportion of original claims that are kept by the filtering mechanism; a measure of the filter's utility or efficiency.

validity: The property that the filtering mechanism actually respects the target error rate (e.g., ensuring 90% of retained sets are fully factual).

group-conditional coverage: Ensuring that the validity guarantee holds within specific subpopulations or groups, not just on average across the whole dataset.

conformity score: A scalar value measuring how 'strange' or 'non-conforming' a data point is; used to calibrate the threshold.

exchangeability: A statistical assumption that the order of data points does not affect their joint distribution; weaker than i.i.d. but sufficient for conformal guarantees.

marginal coverage: Validity guaranteed on average over the entire data distribution.

BCI: Basic Conformal Inference—a baseline method that applies a single global threshold to filter claims.