Knowledge Verification to Nip Hallucination in the Bud

📝 Paper Summary

Hallucination mitigation Data curation for alignment Knowledge consistency

The paper proposes KCA, a method to mitigate hallucinations by detecting alignment data requiring knowledge the model lacks and then applying strategies like open-book tuning or refusal to resolve the inconsistency.

Core Problem

Fine-tuning foundation models on alignment data containing external knowledge absent from their pretraining corpus creates 'knowledge inconsistency,' leading models to hallucinate plausible but factually incorrect responses.

Why it matters:

Models forced to align with data they don't 'know' (from pretraining) often fabricate information to satisfy the instruction
Existing methods for detecting these boundaries rely on simple QA accuracy, failing on complex tasks (e.g., long-form generation) or lacking interpretability
Blindly fine-tuning on inconsistent data degrades truthfulness across diverse benchmarks like medical reporting and RAG

Concrete Example: A foundation model's pretraining corpus lacks information on 'Direct Preference Optimization' (DPO). If the alignment dataset includes a question asking to explain DPO, the model may hallucinate an incorrect explanation during fine-tuning because it lacks the intrinsic knowledge, yet tries to mimic the helpful response style.

Key Novelty

Knowledge Consistent Alignment (KCA)

Uses a well-aligned helper model (e.g., GPT-3.5) to generate multiple-choice exams based on the specific knowledge required by an instruction, testing if the foundation model actually 'knows' the topic
Instead of just filtering data, it applies three specific calibration strategies (Open-book, Discard, Refusal) to handle identified knowledge gaps before fine-tuning
Explicitly generates and utilizes reference knowledge snippets to attribute inconsistencies, rather than relying solely on model confidence scores

Architecture

The overall workflow of Knowledge Consistent Alignment (KCA).

Evaluation Highlights

Reduces hallucination rate by ~5-10% on TruthfulQA compared to standard instruction tuning across Llama-2-7B and Mistral-7B
Refusal tuning strategy achieves the lowest hallucination rates, while open-book tuning best maintains helpfulness
Consistent improvements across 6 diverse benchmarks including general instruction following, RAG, and clinical report generation

Breakthrough Assessment

7/10

A solid data-centric approach to hallucination. Shifting the focus from model architecture to ensuring 'knowledge consistency' in training data is practical and effective, though the reliance on a strong teacher model for exam generation is a constraint.

⚙️ Technical Details

Problem Definition

Setting: Instruction tuning of foundation LLMs where alignment data contains external knowledge not present in the pretraining corpus

Inputs: Alignment dataset D = {(Instruction, Response)}, Foundation Model M

Outputs: Fine-tuned Model M' with reduced hallucination tendencies

Pipeline Flow

Knowledge Requirement Classification (Helper Model G)
Reference Knowledge Generation (Helper Model G)
Examination Formulation (Helper Model G)
Examination Completion (Foundation Model M)
Inconsistency Calibration (Data Modification)

System Modules

Knowledge Requirement Classifier (Detection)

Determine if an instruction requires external knowledge or is just a rewriting/logic task

Model or implementation: GPT-3.5-turbo (Helper Model G)

Knowledge Generator (Detection)

Generate a reference knowledge snippet required to answer the instruction

Model or implementation: GPT-3.5-turbo (Helper Model G)

Exam Formulator (Detection)

Create a multiple-choice quiz based on the generated knowledge snippet to test the foundation model

Model or implementation: GPT-3.5-turbo (Helper Model G)

Student Evaluator (Detection)

Take the generated exam to prove if the model already knows this information

Model or implementation: Foundation Model M (e.g., Llama-2-7B)

Novel Architectural Elements

Self-assessment loop via exam generation: Using a teacher model to generate exams *about* the training data content to verify the student model's prerequisite knowledge before training

Modeling

Base Model: Llama-2 (7B, 13B), Mistral-7B, Pythia-6.9B

Training Method: Supervised Fine-Tuning (SFT) on calibrated data

Objective Functions:

Purpose: Standard language modeling loss.

Formally: Minimize negative log-likelihood of the target tokens given the input.

Adaptation: Full fine-tuning

Training Data:

Alpaca (52k instructions) used as the seed dataset
Split into D_known, D_unknown (no knowledge needed), and D_inconsistent

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 3
+ 4 more
max_length: 2048
warmup_ratio: 0.03
weight_decay: 0.0
lr_scheduler: cosine

Compute: 8 NVIDIA A800 GPUs for training

Comparison to Prior Work

vs. Yang et al. (2023) / Zhang et al. (2023a): KCA handles complex tasks beyond simple QA by generating exams, whereas prior work relies on simple binary correctness of model answers.
vs. Retrieval-Augmented Generation (RAG): KCA is a *training-time* data curation intervention, not an inference-time retrieval mechanism (though Open-book tuning resembles 'training with retrieval').
vs. Rainy Day [not cited in paper]: Similar goal of identifying unknown knowledge, but KCA uses generated exams rather than just probing the model directly.

Limitations

Relies on a powerful aligned LLM (GPT-3.5) to generate knowledge and exams, which may have its own hallucinations or biases.
The process adds significant computational overhead to data preprocessing (generating exams and running inference on them).
Threshold for 'inconsistency' (exam score) is a hyperparameter that may need tuning per model/dataset.

Reproducibility

Code: https://github.com/fanqiwan/KCA

Code, model weights, and data are openly accessible at https://github.com/fanqiwan/KCA. The paper relies on GPT-3.5-turbo for data generation, which is a closed-source dependency.

📊 Experiments & Results

Evaluation Setup

Instruction tuning followed by evaluation on hallucination and helpfulness benchmarks

Benchmarks:

TruthfulQA (Truthful question answering)
FactScore (Factuality evaluation in biography generation)
HaRa (Hallucination Rate evaluation (various tasks))
QAMPARI (Retrieval-augmented generation QA)
SelfCheckGPT (Hallucination detection)
MIMIC-CXR (Clinical report generation)

Metrics:

MC1 (TruthfulQA)
Truthfulness (LLM-judge)
Helpfulness (LLM-judge)
Hallucination Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
KCA variants consistently improve truthfulness (MC1) on TruthfulQA across different model backbones compared to the Vanilla SFT baseline.
TruthfulQA	MC1	27.6	32.4	+4.8
TruthfulQA	MC1	31.9	39.5	+7.6
FactScore	FactScore %	44.6	50.5	+5.9
HaRa	Hallucination Rate (lower is better)	39.7	32.0	-7.7
QAMPARI (RAG)	Recall-5	38.5	46.2	+7.7
Helpfulness evaluation (using GPT-4 judge) shows that KCA strategies maintain competitive helpfulness while reducing hallucinations.
AlpacaEval	Win Rate vs Davinci003	78.4	78.8	+0.4

Experiment Figures

Correlation analysis between Knowledge Inconsistency Percentage and Hallucination Rate across different models.

Win-rate analysis of different KCA strategies against the baseline on Helpfulness vs. Hallucination mitigation.

Main Takeaways

Mitigating knowledge inconsistency significantly reduces hallucinations across diverse tasks.
Refusal Tuning is the most effective strategy for pure hallucination reduction but may be conservative.
Open-book Tuning is best for maintaining helpfulness while still improving factuality, particularly useful for RAG tasks.
Discard Tuning offers a middle ground but reduces dataset size, which might hurt diversity.
The method scales effectively across different model sizes (7B to 13B) and architectures (Llama, Mistral, Pythia).

📚 Prerequisite Knowledge

Prerequisites

Instruction tuning / Alignment (SFT)
Hallucination in LLMs
In-context learning (ICL)
Chain-of-Thought (CoT)

Key Terms

Knowledge Inconsistency: The discrepancy between the external knowledge required by alignment data and the intrinsic knowledge encoded in a foundation model during pretraining

KCA: Knowledge Consistent Alignment—the proposed framework to detect and fix knowledge inconsistencies before fine-tuning

Open-book Tuning: A calibration strategy where the missing reference knowledge is appended to the instruction context during fine-tuning

Refusal Tuning: A calibration strategy where the target response is modified to explicitly refuse answering due to lack of knowledge

Discard Tuning: A calibration strategy where data instances causing knowledge inconsistency are simply removed from the training set

ICL: In-Context Learning—prompting a model with examples to perform a task without updating weights

CoT: Chain-of-Thought—prompting a model to generate intermediate reasoning steps

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs