Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors

📝 Paper Summary

Hallucination suppression Factual consistency evaluation

DEEP ensembles binary outputs from diverse LLM prompts to detect factual errors in summaries, using calibration to provide reliable probability estimates without fine-tuning the underlying model.

Core Problem

Existing factual consistency models (like fine-tuned RoBERTa) rely on thresholding techniques that require access to labeled target data, which is unrealistic in practice. Furthermore, individual LLM prompts are often overconfident and fail to capture nuances.

Why it matters:

Optimizing thresholds on test data artificially inflates performance; real-world usage requires models that work on unseen data without tuning
Current SOTA encoder models perform significantly worse when thresholds are not optimized on the specific dataset being tested
LLMs produce convincing but false information (hallucinations), making automated, reliable error detection critical for high-stakes summarization tasks

Concrete Example: When evaluating the TofuEval dataset, a standard factual consistency model might need a threshold of 0.8 to work well, but on AggreFact, it might need 0.4. Without knowing the 'correct' threshold beforehand (which requires labeled data), the model's accuracy drops significantly.

Key Novelty

Ensembling diverse LLM prompts via weak supervision

Treat the outputs of multiple, diverse LLM prompts (each checking for factuality in different ways) as binary features
Feed these features into a lightweight ensemble model (like Snorkel's LabelModel) to aggregate predictions
Calibrate the final probability output to ensure the reported confidence matches empirical accuracy

Architecture

The complete DEEP framework pipeline: Prompts -> Binary Features -> Ensembler -> Calibrator

Evaluation Highlights

Achieves State-of-the-Art balanced accuracy on AggreFact-XSUM FTSOTA (71.9%), TofuEval Summary-Level (79.4%), and HaluEval Summarization (74.1%)
Ensembling just 3 prompts consistently yields performance improvements over the single best individual prompt across all datasets
Calibration using Platt Scaling reduces Expected Calibration Error (ECE) to under 6% for top models, significantly mitigating overconfidence

Breakthrough Assessment

8/10

Significantly outperforms encoder-based baselines in realistic settings (no test-set thresholding) and demonstrates that ensembling LLM prompts is a viable, superior alternative to fine-tuning for factuality detection.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of text summaries as factually consistent or inconsistent relative to a source document

Inputs: A source context (document) and a generated summary

Outputs: Probability that the summary is factually consistent (free of errors)

Pipeline Flow

Prompt Generation (N unique prompts generate binary scores)
Feature Vector Construction (Concatenate prompt outputs)
Ensembling (Combine features into one probability)
Calibration (Adjust probability to match empirical accuracy)

System Modules

Prompt Pool

Generate diverse binary judgments (consistent/inconsistent) using different reasoning strategies

Model or implementation: GPT-4-Turbo and GPT-3.5-Turbo

Ensembler

Aggregate binary prompt outputs into a single prediction

Model or implementation: Various (LabelModel, LogisticRegression, AdaBoost, etc.)

Calibrator

Map uncalibrated scores to reliable probability estimates

Model or implementation: Platt Scaling (Logistic Regression on scores)

Novel Architectural Elements

End-to-end framework treating LLM prompts as weak supervision labeling functions for a separate ensemble model
Application of post-hoc calibration (Platt Scaling) specifically to ensembled prompt outputs for factuality detection

Modeling

Base Model: GPT-4-Turbo (gpt-4-0125-preview) and GPT-3.5-Turbo (gpt-3.5-turbo-1106)

Training Method: Ensemble training (Logistic Regression, LabelModel, etc.) on fixed LLM outputs

Adaptation: None (LLMs are frozen; ensemble models are trained on LLM outputs)

Trainable Parameters: Parameters of the lightweight ensemble models (e.g., weights in Logistic Regression)

Training Data:

Training uses 3 non-test datasets to learn ensemble parameters (leave-one-out strategy)
Feature selection using RFE and mRMR to select best prompt subsets

Key Hyperparameters:

ensemble_methods: 16 methods tested (LabelModel, AdaBoost, etc.)
calibration_bins: 8 (for ECE calculation)
prompt_pool_sizes: 3, 5, 9 prompts

Compute: Requires N LLM API calls per summary (where N is number of prompts); significantly more resource-intensive than encoder models

Comparison to Prior Work

vs. Encoder Models (AlignScore, QAFactEval): DEEP does not require dataset-specific threshold tuning (which is unrealistic) to perform well
vs. Single LLM Prompts (ChatGPT-CoT): DEEP reduces variance and overconfidence by aggregating multiple diverse prompt perspectives
vs. Tang et al. (2024): DEEP uses an ensemble of prompts rather than a single prompt and applies calibration to fix overconfidence [cited in paper]

Limitations

Significantly more resource-intensive (computation/cost) than encoder models due to multiple LLM calls
Performance depends on the quality of the underlying LLM (tested primarily on GPT-4/3.5)
Does not currently support multilingual summarization error detection
Uncertainty about labeling errors in ground truth datasets (AggreFact/HaluEval)

Reproducibility

Code: https://github.com/AlexChandler/DEEP

Code and data available on GitHub (https://github.com/AlexChandler/DEEP). Prompts are provided in Appendix. Specific LLM versions (gpt-4-0125-preview, gpt-3.5-turbo-1106) are specified.

📊 Experiments & Results

Evaluation Setup

Detecting factual errors in abstractive summaries generated by Transformers

Benchmarks:

AggreFact-XSUM FTSOTA (Factual consistency classification (Binary))
TofuEval Summary-Level (Factual consistency classification (Binary))
HaluEval Summarization (Hallucination detection (Binary))

Metrics:

Balanced Accuracy
Expected Calibration Error (ECE)
Statistical methodology: Bootstrap resampling with Bonferroni adjustment (p=0.01)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against State-of-the-Art Encoder Models (without oracle thresholding). The paper highlights that baselines perform poorly when thresholds aren't tuned on the test set.
AggreFact-XSUM FTSOTA	Balanced Accuracy	70.2	71.9	+1.7
HaluEval Summarization	Balanced Accuracy	64.3	74.9	+10.6
TofuEval MediaSum Summary-Level	Balanced Accuracy	63.6	66.3	+2.7
TofuEval MeetingBank Summary-Level	Balanced Accuracy	72.9	79.4	+6.5
Calibration results demonstrating the effectiveness of Platt Scaling.
AggreFact-XSUM FTSOTA	ECE (Expected Calibration Error)	23.8	4.7	-19.1

Experiment Figures

Optimal thresholds for factual consistency models across different datasets

Reliability diagrams comparing uncalibrated vs. calibrated predictions

Main Takeaways

Encoder-based models (AlignScore, QAFactEval) are highly sensitive to threshold selection; their performance drops significantly when thresholds are not tuned on the target test set (a realistic constraint)
Ensembling as few as 3 diverse LLM prompts consistently outperforms single-prompt baselines and encoder models across multiple datasets
Calibration (specifically Platt Scaling) effectively mitigates the inherent overconfidence of LLM binary predictions, reducing ECE to <7%
Snorkel's LabelModel is frequently the top performing ensemble method, likely due to its ability to handle noisy weak supervision signals

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting
Familiarity with ensemble learning methods
Knowledge of calibration metrics like Expected Calibration Error (ECE)

Key Terms

Hallucinations: Instances where a model generates plausible but fabricated information not supported by the source

Factual Inconsistency: Generated text that contradicts the source material or established facts

Ensemble Learning: Merging outputs of multiple models (in this case, prompts) to produce a more accurate prediction

Balanced Accuracy: The arithmetic mean of sensitivity and specificity, used to evaluate performance on imbalanced datasets

ECE: Expected Calibration Error—a metric measuring the difference between a model's predicted confidence and its actual accuracy

Platt Scaling: A parametric calibration method that applies a logistic regression to model outputs to produce calibrated probabilities

LabelModel: A method from the Snorkel framework that learns conditional probabilities of noisy labeling functions (prompts) to reweight their outputs without ground truth data

Chain of Thought (CoT): A prompting technique where the model produces intermediate reasoning steps before the final answer

Weak Supervision: Using noisy, limited, or imprecise sources (like heuristics or prompts) to label training data

RFE: Recursive Feature Elimination—a feature selection technique that recursively removes the least important features

mRMR: Minimum Redundancy Maximum Relevance—a feature selection method maximizing relevance to the target while minimizing redundancy among features