RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models

📝 Paper Summary

Hallucination detection Factuality evaluation

RefChecker detects hallucinations by decomposing LLM responses into knowledge triplets and verifying each against references, significantly improving detection accuracy over sentence-level methods.

Core Problem

Existing hallucination detection methods operate at coarse granularities (response or sentence level), which often miss subtle fabrications or get confused by complex sentences containing both facts and errors.

Why it matters:

Long responses from models like Llama 2 (avg 150 tokens) contain mixed truth and falsehoods; coarse checks cause false negatives
Sub-sentence phrase extraction is structurally ill-defined, making it hard to create high-quality few-shot demonstrations for LLM-based evaluators
Current benchmarks lack diversity across real-world settings like zero-context generation vs. RAG (Noisy Context)

Concrete Example: If an LLM says 'Apple released the iPhone 15 in 2021,' a sentence-level checker might label the whole sentence 'False' or 'True' ambiguously. RefChecker extracts the triplet (iPhone 15, released_in, 2021), enabling precise verification against a reference that states the correct year.

Key Novelty

Claim-Triplet Granularity for Verification

Decomposes text into (Subject, Relation, Object) triplets rather than sentences or arbitrary phrases, providing structured units that are easier to verify
Introduces a three-way classification (Entailment, Neutral, Contradiction) to handle unverifiable claims, rather than just binary factual/non-factual
Formalizes three distinct evaluation settings: Zero Context (closed-book), Noisy Context (RAG), and Accurate Context (summarization/IE)

Architecture

The RefChecker pipeline flow from input response to final hallucination assessment.

Evaluation Highlights

RefChecker (Claude 2 + GPT-4) outperforms the best prior method (FacTool) by 6.8 to 26.1 points in correlation with human judgment
Checking at the triplet level improves detection performance by 4 to 9 points compared to response, sentence, or sub-sentence granularities
Using a fine-tuned open-source model (Mistral 7B) as a checker achieves strong correlation with human annotations, offering a cost-effective alternative to proprietary models

Breakthrough Assessment

8/10

Significantly refines the unit of analysis for hallucination. The shift to triplets is intuitively sound and empirically validated as superior to sentence-level checks. The release of a fine-grained benchmark with 11k annotations is a major resource contribution.

⚙️ Technical Details

Problem Definition

Setting: Given a response R and a reference passage P (or internal knowledge), identify specific spans in R that contradict P

Inputs: LLM Response R, Reference Context C (optional/retrieved)

Outputs: Set of claim-triplets T = {(h, r, t)} and their labels (Entailment, Neutral, Contradiction)

Pipeline Flow

Group: Response Processing: LLM Response → Extractor → [Claim-Triplets]
Group: Verification: [Claim-Triplets] + Reference → Checker → [Labels (Entailment/Neutral/Contradiction)]
Group: Aggregation: [Labels] → Aggregator → Final Hallucination Score

System Modules

Extractor

Decompose the response into a list of atomic knowledge triplets

Model or implementation: Fine-tuned Mistral 7B (via knowledge distillation from Mixtral 8x7B) OR GPT-4/Claude 2

Checker

Verify each triplet against the reference text

Model or implementation: Mistral 7B (RepC or LoRA) OR GPT-4/Claude 2 OR RoBERTa-NLI

Aggregator

Combine triplet labels into a response-level score

Model or implementation: Rule-based

Novel Architectural Elements

Two-stage pipeline explicitly decoupling extraction (via triplets) from verification
Use of Representation-based Classifiers (RepC) on open-source LLM hidden states for NLI checking

Modeling

Base Model: Mistral 7B (for the open-source Extractor and Checker variants)

Training Method: Supervised Fine-Tuning (SFT) for Extractor; LoRA or Representation Classification (RepC) for Checker

Adaptation: LoRA (Low-Rank Adaptation) used for one variant of the Checker; Knowledge Distillation used for the Extractor

Trainable Parameters: Small subset (LoRA) or shallow classifier heads (RepC)

Training Data:

Extractor trained on 10k responses distilled from Mixtral 8x7B
Checker fine-tuned using NLI data

Compute: Not reported in the paper

Comparison to Prior Work

vs. SelfCheckGPT: RefChecker uses explicit reference checking rather than stochastic self-consistency [not cited in paper]
vs. FActScore: Uses structured triplets instead of unstructured 'atomic facts' (sub-sentences), enabling better definition and verification
vs. FacTool: Distinguishes 'Neutral' (unverifiable) from 'Contradiction', whereas FacTool is binary factual/non-factual
+ 1 more
vs. Granularity: Proves triplet-level is superior to sentence or sub-sentence level via ablation studies

Limitations

Does not consider unmentioned aspects (recall-oriented hallucination), focusing only on precision of generated content
Triplet extraction can lose some nuance compared to full natural language sentences
Dependency on the quality of the reference text (if the reference is wrong, the check is wrong)
Computational cost of two-step extraction and verification is higher than simple response-level classification

Reproducibility

Code: https://github.com/amazon-science/RefChecker

Code and data are publicly available at https://github.com/amazon-science/RefChecker. The benchmark includes 11k annotated triplets. The open-source extractor (Mistral-based) is provided.

📊 Experiments & Results

Evaluation Setup

Hallucination detection across three settings: Zero Context (closed-book QA), Noisy Context (RAG with noisy retrieval), and Accurate Context (summarization/IE)

Benchmarks:

RefChecker Benchmark (Hallucination Detection) [New]

Metrics:

Correlation with human judgment (Pearson and Spearman)
Macro-F1 score (for granularity comparison)
Statistical methodology: Pearson and Spearman correlation coefficients calculated between model predictions and human annotations

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of RefChecker against baselines using Pearson correlation with human judgment across three context settings.
RefChecker Benchmark (Zero Context)	Pearson Correlation	57.7	81.6	+23.9
RefChecker Benchmark (Noisy Context)	Pearson Correlation	58.1	84.2	+26.1
RefChecker Benchmark (Accurate Context)	Pearson Correlation	45.0	69.5	+24.5
Ablation study showing the impact of claim granularity on detection performance.
RefChecker Benchmark (Aggregated)	Macro-F1	73	79	+6
RefChecker Benchmark (Aggregated)	Macro-F1	67	79	+12

Experiment Figures

Comparison of hallucination detection performance (Macro-F1) across different checking granularities (Response, Sentence, Sub-sentence, Triplet).

Heatmap of Spearman correlations for various Extractor + Checker combinations against human judgment.

Main Takeaways

Triplet-level checking consistently outperforms sentence, sub-sentence, and response-level checking, offering the best balance of granularity and context.
Context quality significantly impacts hallucination rates: contradictions drop from 25% (Zero Context) to 13% (Noisy Context) to 6% (Accurate Context).
Open-source checkers (Mistral-RepC) are competitive with proprietary models (GPT-4), providing a viable path for privacy-preserving or cost-sensitive applications.
Sub-sentence splitting (used in FActScore) can sometimes degrade performance compared to sentence-level because phrases become structurally ambiguous or lose necessary context.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (triplet structure)
Natural Language Inference (NLI)
Retrieval-Augmented Generation (RAG)

Key Terms

Claim-triplet: A factual claim represented as a (head_entity, relation, tail_entity) structure extracted from natural language text

Zero Context (ZC): A setting where the LLM answers based solely on internal memory without external documents

Noisy Context (NC): A RAG setting where the LLM answers based on retrieved documents that may contain irrelevant or noisy information

Accurate Context (AC): A setting where the provided reference text is assumed to be correct and relevant (e.g., summarization tasks)

NLI: Natural Language Inference—determining if a hypothesis is true (entailment), false (contradiction), or unrelated (neutral) given a premise

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights

RepC: Representation-based Classifier—a checker that uses a shallow classifier (like SVM or MLP) on top of an LLM's internal hidden states

RAG: Retrieval-Augmented Generation—providing external documents to an LLM to ground its answers