MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

📝 Paper Summary

Hallucination suppression Fact-checking / Verification Knowledge internalization (via synthetic data)

MiniCheck is a small (770M parameter) fact-checking model that matches GPT-4 performance at 400x lower cost by training on novel synthetic data requiring multi-sentence reasoning.

Core Problem

Verifying LLM outputs against evidence is computationally expensive using large models like GPT-4, while smaller specialized models struggle with complex, multi-fact reasoning.

Why it matters:

Current methods like self-verification with LLMs are too costly (e.g., checking 40 facts against 5 documents results in 200+ checks per response)
Specialized smaller models often fail to recognize when a claim aggregates information across multiple sentences or contains multiple atomic facts
Existing datasets (MNLI, ANLI) do not reflect the complexity of modern LLM hallucination patterns

Concrete Example: An LLM claims 'Two people argue about weather and labor issues.' The document contains two separate sentences: one person mentions weather, another mentions labor. A standard entailment model might incorrectly flag this as unsupported because no single sentence contains both topics, whereas MiniCheck correctly aggregates evidence across sentences.

Key Novelty

Synthetic Data for Complex Fact-Checking (C2D & D2C)

Generates synthetic training data where verifying a claim requires combining information from multiple sentences (Claim-to-Doc and Doc-to-Claim methods)
Constructs 'hard negatives' by removing specific sentences from the evidence that support only part of a complex claim, forcing the model to verify *all* atomic facts
Unifies 10 existing datasets into a new benchmark (LLM-AggreFact) to evaluate fact-checking across diverse grounding settings

Architecture

The synthetic data generation process (Claim-to-Doc and Doc-to-Claim).

Evaluation Highlights

MiniCheck-FT5 (770M params) reaches GPT-4 performance levels on the LLM-AggreFact benchmark while being 400x cheaper
Outperforms AlignScore-Large (355M params), the previous state-of-the-art specialized model, by ~4-10% accuracy
Demonstrates that decomposing sentences into atomic facts is unnecessary for high performance when using MiniCheck

Breakthrough Assessment

9/10

Achieving GPT-4 level performance with a 770M parameter model on a critical task like hallucination detection is a significant efficiency breakthrough. The synthetic data methodology is highly generalizable.

⚙️ Technical Details

Problem Definition

Setting: Document-grounded fact-checking where a claim (sentence) must be classified as supported or unsupported by a set of grounding documents

Inputs: A claim sentence c and a set of grounding documents D

Outputs: Binary label (0 for unsupported, 1 for supported)

Pipeline Flow

Input Processing (Claim c, Document D)
MiniCheck Model (Classifies Entailment)
Output (Supported/Unsupported)

System Modules

MiniCheck Model

Predicts whether the document supports the claim

Model or implementation: Flan-T5-Large (770M parameters) or DeBERTa-V3-Large

Novel Architectural Elements

Does not introduce new model architecture, but introduces a novel synthetic data pipeline (C2D/D2C) that effectively teaches small models multi-hop reasoning without explicit claim decomposition at inference time

Modeling

Base Model: Flan-T5-Large (770M params), DeBERTa-V3-Large (435M params), RoBERTa-Large (355M params)

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize classification error between predicted entailment and synthetic ground truth.

Formally: Standard cross-entropy loss.

Training Data:

14K synthetic examples generated via C2D and D2C methods using GPT-4
21K examples from ANLI (Adversarial NLI) dataset
Total training size: ~35K examples

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Inference cost is ~400x lower than GPT-4 (based on pricing/token analysis). Training compute details not explicitly reported.

Comparison to Prior Work

vs. AlignScore: MiniCheck is trained on targeted synthetic 'hard negatives' requiring multi-sentence reasoning, whereas AlignScore uses naturally occurring data which may be easier.
vs. GPT-4: MiniCheck achieves similar accuracy but is significantly smaller (770M vs >1T) and cheaper.
vs. FactScore (decomposition-based): MiniCheck shows that explicit decomposition (breaking sentences into atoms) is not necessary for high performance if the model is trained on data that requires implicit decomposition.

Limitations

Synthetic data generation relies on GPT-4, which has a cost associated with dataset creation
Performance depends on the quality of the grounding documents provided (garbage in, garbage out)
Does not explicitly handle long-context documents where evidence is scattered across very large distances (beyond standard context windows)
No statistical significance tests reported for the improvement margins

Reproducibility

Code: https://github.com/Liyan06/MiniCheck

Code, synthetic data, and models are publicly released at https://github.com/Liyan06/MiniCheck. The paper details the prompts used for GPT-4 synthetic data generation in the Appendix.

📊 Experiments & Results

Evaluation Setup

Unified benchmark of 10 datasets covering summarization, dialogue, and QA

Benchmarks:

LLM-AggreFact (Factual Consistency Checking) [New]

Metrics:

Balanced Accuracy (BAcc)
Pearson Correlation
GPT-4 Estimation Cost
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MiniCheck-FT5 outperforms comparable baselines and matches GPT-4 on the aggregated LLM-AggreFact benchmark.
LLM-AggreFact	Balanced Accuracy	73.9	78.4	+4.5
LLM-AggreFact	Balanced Accuracy	77.7	78.4	+0.7
LLM-AggreFact	Cost (USD)	1.0	0.0025	-0.9975

Experiment Figures

Comparison of Accuracy vs. Cost for various models on LLM-AggreFact.

Main Takeaways

MiniCheck-FT5 generalizes well across different domains (summarization, QA, dialogue) despite being trained only on synthetic data and ANLI.
The synthetic data strategy (C2D/D2C) is critical; it forces the model to learn 'implicit decomposition'—verifying all parts of a complex sentence without needing an external splitter.
Decomposition-based evaluation (breaking claims into atoms first) does not improve MiniCheck's performance, suggesting the model already internalizes this process.
Small models can effectively replace large LLMs for the specific primitive of grounding verification if trained on high-quality, structurally difficult synthetic data.

📚 Prerequisite Knowledge

Prerequisites

Natural Language Inference (NLI) / Textual Entailment
Instruction Tuning / Fine-tuning
Synthetic Data Generation with LLMs

Key Terms

Grounding documents: External text (like retrieved passages) used as evidence to verify a model's generated response

Atomic facts: The smallest indivisible units of information within a sentence (e.g., 'Obama was born in Hawaii' is one fact; 'Obama, born in Hawaii, was President' has two)

C2D (Claim-to-Doc): A synthetic data generation method that starts with a claim, decomposes it, and generates documents that support or refute specific parts of it

D2C (Doc-to-Claim): A synthetic data generation method that starts with a document chunk, summarizes it, and creates variations to test entailment

LLM-AggreFact: A new benchmark aggregation introduced in this paper, combining 10 existing datasets for evaluating factual consistency

Decontextualization: Rewriting a sentence so it stands alone without surrounding context (e.g., resolving pronouns like 'he' to 'Obama')

SFT: Supervised Fine-Tuning—training a model on labeled examples

NLI: Natural Language Inference—determining if a hypothesis is true given a premise