Alleviating Hallucinations of Large Language Models through Induced Hallucinations

📝 Paper Summary

Hallucination mitigation Factuality in LLMs Decoding strategies

Induce-then-Contrast Decoding (ICD) reduces LLM hallucinations by first creating a factually weak model that is prone to fabrication, then decoding by penalizing this weak model's predictions.

Core Problem

Large Language Models frequently generate hallucinations (inaccurate or fabricated information), and existing mitigation methods like supervised fine-tuning (SFT) or pre-training modifications are computationally expensive or may inadvertently encourage hallucination.

Why it matters:

Hallucinations hinder the practical application of LLMs in real-world scenarios where factual accuracy is critical
Standard SFT can encourage models to answer beyond their knowledge boundaries, worsening hallucination (behavior cloning)
Directly modifying pre-training objectives is costly and may compromise generalization abilities

Concrete Example: When asked for a biography of 'Vasily Chuikov', a standard model might incorrectly state he died in 1967 (actual death: 1982). A model fine-tuned on factual data might still hallucinate due to behavior cloning, whereas ICD penalizes the likelihood of such common fabrications.

Key Novelty

Induce-then-Contrast Decoding (ICD)

Deliberately constructs a 'factually weak' model by fine-tuning the base LLM on non-factual (hallucinated) samples generated by ChatGPT.
Uses this weak model as a negative constraint during inference: the final token probability amplifies the base model's prediction while subtracting the weak model's probability.
Introduces an 'adaptive plausibility constraint' to ensure only plausible tokens are penalized, preventing degradation of grammar and fluency.

Architecture

Illustration of the Induce-then-Contrast Decoding (ICD) method compared to standard Next Token Prediction.

Evaluation Highlights

+8.70 (MC1) and +14.18 (MC2) improvement on TruthfulQA for Llama2-7B-Chat, allowing it to outperform the much larger Llama2-70B-Chat.
Achieves a factual precision score of 66.3 on FActScore with Llama2-7B-Chat, surpassing the 70B counterpart (64.4) using greedy decoding.
Maintains general capability on benchmarks like MMLU (47.7 vs 47.7) and ARC (57.1 vs 56.7) while significantly reducing hallucinations.

Breakthrough Assessment

8/10

Offers a highly effective, compute-efficient inference-time solution that reverses the typical logic of SFT (training to be bad to know what to avoid). Significant gains over larger models.

⚙️ Technical Details

Problem Definition

Setting: Auto-regressive next-token prediction with a focus on factual consistency

Inputs: System prompt s and user input u

Outputs: Target output o sequence of tokens

Pipeline Flow

Hallucination Induction: Fine-tune base LLM on non-factual data to create weak model
Contrastive Inference: Run both Base LLM and Weak LLM on input
Logit Subtraction: Calculate Final Logits = Base Logits - Beta * Weak Logits
Adaptive Masking: Mask tokens with low base probability
Final Prediction: Select token from modified distribution

System Modules

Hallucination Inducer

Create a 'weak' model prone to hallucination

Model or implementation: Llama-2-7B (Base)

Base Decoder (Inference)

Generate standard next-token probabilities

Model or implementation: Llama-2-7B-Chat / Mistral-7B-Instruct

Weak Decoder (Inference)

Generate hallucinated next-token probabilities for penalty

Model or implementation: Fine-tuned Weak Model

Contrastive Mixer (Inference)

Combine logits and apply adaptive masking

Model or implementation: Mathematical Operation

Novel Architectural Elements

Induce-then-Contrast workflow: Explicitly training a model to fail (hallucinate) to serve as a precise negative constraint
Adaptive Plausibility Constraint: Dynamic filtering of the contrastive penalty based on the base model's confidence to protect fluency

Modeling

Base Model: Llama-2-7B-Chat, Llama-2-13B-Chat, Llama-2-70B-Chat, Baichuan2-7B/13B-Chat, Mistral-7B-Instruct

Training Method: Supervised Fine-Tuning (SFT) for the 'weak' model only

Objective Functions:

Purpose: Induce hallucinations in the weak model.

Formally: Maximize log probability p(o|s,u) where 'o' is a non-factual target output.

Adaptation: Full fine-tuning (implied, as parameter scale is discussed)

Trainable Parameters: All parameters of the weak model copy

Training Data:

10k hallucinated QA pairs from HaluEval (for TruthfulQA)
3.5k hallucinated biographies generated by ChatGPT (for FActScore)

Key Hyperparameters:

beta (contrast strength): Not explicitly reported in the paper (general range (0, +infinity) mentioned)
alpha (plausibility constraint): Not explicitly reported in the paper (general range [0, 1] mentioned)
fine_tuning_epochs: Not reported in the paper
+ 1 more
learning_rate: Not reported in the paper

Compute: Requires maintaining two models in memory during inference (Base + Weak)

Comparison to Prior Work

vs. DoLa: ICD contrasts against a specifically induced hallucinating model rather than just early layers, providing a more targeted penalty for non-factual information.
vs. Vanilla CD: ICD uses a fine-tuned weak version of the *same* model architecture rather than a smaller amateur model, ensuring the contrast is focused on factuality rather than general language modeling capability.
vs. ITI: ICD operates at the decoding probability level rather than intervening on internal activations.
+ 1 more
vs. Direct SFT (on factual data) [not cited in paper as baseline, but compared]: ICD prevents behavior cloning issues where models learn to answer everything regardless of knowledge; direct SFT was shown to increase hallucination rates in the paper's experiments.

Limitations

Incurs additional memory cost (requires loading two models) and latency during decoding.
Fine-tuning based induction requires a training phase and synthetic data generation.
Summarization tasks showed less improvement compared to QA, likely due to different hallucination types (input-conflicting vs fact-conflicting).
Performance depends on the quality of the induced weak model; if the induction fails, the contrast may not be effective.

Reproducibility

Code: https://github.com/hillzhang1999/ICD

Code and data available at https://github.com/hillzhang1999/ICD. The exact hyperparameters (alpha, beta) for the reported results are not listed in the main text. The fine-tuning data generation process (ChatGPT prompts) is described.

📊 Experiments & Results

Evaluation Setup

Discrimination-based and Generation-based hallucination evaluation

Benchmarks:

TruthfulQA (Multiple-choice QA)
FActScore (Biography Generation)
MMLU (General Knowledge)
ARC (Reasoning)
AlpacaEval 2.0 (Instruction Following)

Metrics:

MC1 (TruthfulQA)
MC2 (TruthfulQA)
MC3 (TruthfulQA)
Factual Precision Score (FActScore)
% Response (FActScore)
# Facts (FActScore)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on TruthfulQA showing ICD significantly improves truthfulness scores compared to baselines.
TruthfulQA	MC1	37.62	46.32	+8.70
TruthfulQA	MC2	54.60	68.78	+14.18
TruthfulQA	MC2	63.95	68.78	+4.83
Results on FActScore (Biography Generation) demonstrating improved factual precision.
FActScore	Factual Precision (%)	63.8	66.3	+2.5
FActScore	Factual Precision (%)	64.4	66.3	+1.9
Comparison against Direct Fine-tuning (SFT) on factual data, showing that direct SFT degrades performance.
FActScore	Factual Precision (%)	63.8	28.7	-35.1
FActScore	% Response	37.5	99.5	+62.0

Experiment Figures

Bar chart comparing MC2 scores on TruthfulQA for various models and methods (Llama2, Mistral, ChatGPT, GPT-4).

Pair-wise automatic evaluation (win-rate) comparing ICD to Greedy decoding on three dimensions: Factuality, Grammaticality, and Topicality.

Main Takeaways

ICD significantly improves factuality on both discrimination (TruthfulQA) and generation (FActScore) tasks, allowing 7B models to rival or beat 70B models.
Directly fine-tuning on factual data (SFT) can catastrophically degrade factual precision (-35.1%) due to behavior cloning, where the model learns to answer every query regardless of knowledge boundaries.
The 'Weak-to-Strong' generalization concept works for hallucination: a purposefully weak model can be used to elicit stronger factuality from the base model.
ICD generalizes across model families (Llama2, Baichuan2, Mistral) and sizes, with larger improvements often seen in stronger base models (e.g., Mistral).

📚 Prerequisite Knowledge

Prerequisites

Understanding of auto-regressive decoding (next-token prediction)
Familiarity with Contrastive Decoding (CD)
Basic knowledge of Supervised Fine-Tuning (SFT)

Key Terms

Hallucination: The generation of content that contradicts user input, context, or established real-world facts

Contrastive Decoding: A decoding strategy that determines next-token probabilities by contrasting the logits of a strong model against a weak (or amateur) model

SFT: Supervised Fine-Tuning—training a pre-trained model on specific input-output pairs to adapt it for a downstream task

FActScore: An evaluation benchmark that breaks generated text (like biographies) into atomic facts and verifies them against a knowledge source (e.g., Wikipedia)

TruthfulQA: A benchmark designed to measure whether language models generate truthful answers to questions

Logits: The raw, unnormalized scores output by the last layer of a neural network before applying softmax

Adaptive Plausibility Constraint: A filtering mechanism in ICD that only applies the contrastive penalty to tokens that have a sufficiently high probability in the base model, preserving fluency

Behavior Cloning: A phenomenon where a model learns to mimic the surface form of the training data (e.g., answering every question) without learning the underlying logic or truthfulness