DHI: Leveraging Diverse Hallucination Induction for Enhanced Contrastive Factuality Control in Large Language Models

📝 Paper Summary

Hallucination suppression Contrastive Decoding

DHI mitigates hallucinations by training an 'Evil Model' to generate diverse errors via a reversed loss on correct tokens, then using its output to penalize low-confidence predictions from a base model.

Core Problem

Existing methods like ICD (Induce-then-Contrast Decoding) train 'Evil' models on narrow, pre-defined hallucination types, limiting their ability to catch diverse errors and requiring expensive annotated datasets.

Why it matters:

Models trained on specific error patterns struggle to generalize to unseen hallucination types.
Inducing hallucinations at specific positions can disrupt the coherence of subsequent token generation, degrading overall text quality.
Dependence on curated hallucination datasets limits scalability and applicability to new domains.

Concrete Example: If a user asks 'Who won the 2020 election?', a standard 'Evil' model trained only to swap names might say 'Trump', but miss other error types like fabricating a date. DHI's Evil Model is discouraged from saying 'Biden' entirely, forcing it to explore various wrong paths (names, dates, entities), providing a richer signal for correction.

Key Novelty

Diverse Hallucination Induction (DHI)

Instead of teaching the Evil Model specific wrong answers, DHI penalizes the *correct* answer in the loss function, forcing the model to generate *anything but* the truth.
Modifies the attention mask during Evil Model training so that the artificially induced errors don't confuse the model when generating subsequent tokens, preserving fluency.
Applies contrastive penalties only when the base (Positive) model is uncertain, preventing the degradation of high-confidence correct facts.

Architecture

Conceptual comparison of standard hallucination induction (ICD) vs. Diverse Hallucination Induction (DHI).

Evaluation Highlights

Achieves highest average score of 53.2 on TruthfulQA, outperforming ICD (50.5) and CD (42.2).
+3.7 point improvement on TruthfulQA MC3 (complex reasoning) compared to the strongest baseline (ICD).
Achieves 68.1 Factual Precision Score on FactScore (biography generation), surpassing ICD by 1.8 points.

Breakthrough Assessment

7/10

Offers a clever inversion of the training objective (penalizing truth rather than rewarding specific lies) to improve generalization. Strong empirical results, though primarily an evolution of the existing ICD framework.

⚙️ Technical Details

Problem Definition

Setting: Mitigating hallucinations in open-ended generation and QA by contrasting a base model with a fine-tuned negative expert.

Inputs: Input context X (e.g., background knowledge and question).

Outputs: Token sequence Y with maximized factuality.

Pipeline Flow

Input Processing: Feed input X to Positive Model and Evil Model
Logit Generation: Compute logits from both models
Adaptive Selection: Identify valid tokens where Positive Model has high confidence
Contrastive Adjustment: Subtract weighted Evil logits from Positive logits for valid tokens
Sampling: Select next token from adjusted distribution

System Modules

Positive Model

Generate standard next-token predictions based on learned factual knowledge

Model or implementation: Llama-2-7B-Chat (or 70B for comparisons)

Evil Model

Generate potential hallucinations to be subtracted/penalized

Model or implementation: Llama-2-7B-Base (fine-tuned with DHI objective)

Adaptive Contrastive Decoder

Combine logits using adaptive rationality constraint

Model or implementation: Algorithm (Inference logic)

Novel Architectural Elements

Modified causal attention mask during Evil Model training (removes influence of hallucination-targeted positions on future tokens)

Modeling

Base Model: Llama-2-7B-Base (for Evil Model training) / Llama-2-7B-Chat (as Positive Model)

Training Method: Supervised Fine-Tuning (SFT) with modified loss and attention

Objective Functions:

Purpose: Discourage the Evil Model from generating the correct answer while maintaining fluency.

Formally: Loss = -log(P(y_t|x)) for non-target tokens + alpha * log(P(y_t|x)) for target factual tokens (Note: paper says 'loss associated with generating correct tokens is assigned a negative value', effectively maximizing loss or minimizing probability of correct token).

Adaptation: LoRA

Training Data:

Derived from HaluEval QA subset
Expanded short answers into full declarative sentences using GPT-4o
9,196 training samples after filtering

Key Hyperparameters:

alpha (training penalty strength): 0.0 (TruthfulQA), 0.1 (FactScore)
beta (inference contrast weight): 1.0 (TruthfulQA), 2.0 (FactScore)
LoRA settings: Not explicitly detailed beyond 'utilizing LoRA'

Compute: Not reported in the paper

Comparison to Prior Work

vs. ICD: DHI penalizes correct tokens generally rather than rewarding specific wrong tokens, and uses adaptive masking to preserve coherence.
vs. CD: DHI contrasts specifically induced hallucination tendencies rather than generic model size differences.
vs. DoLa: DHI uses a separate fine-tuned negative expert rather than internal layer contrasting.

Limitations

Relies on the availability of a base dataset (HaluEval) to identify 'correct' answers to penalize.
Requires fine-tuning an auxiliary model (Evil Model), which adds computational overhead compared to training-free methods like DoLa.
Performance depends on hyperparameters alpha and beta which vary by task.

Reproducibility

Training data derivation logic (HaluEval + GPT-4o expansion) is described. Code URL not provided. Hyperparameters alpha and beta are provided for specific benchmarks.

📊 Experiments & Results

Evaluation Setup

Multiple-choice QA and open-ended biography generation.

Benchmarks:

TruthfulQA (Multiple-choice QA)
FACTSCORE (Long-form biography generation)

Metrics:

MC1
MC2
MC3
Factual Precision Score
Number of Facts
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on TruthfulQA shows DHI outperforming baselines across all metrics, with significant gains in complex reasoning (MC3).
TruthfulQA	Average (MC1/2/3)	50.5	53.2	+2.7
TruthfulQA	MC1	40.5	41.9	+1.4
TruthfulQA	MC2	69.7	72.6	+2.9
TruthfulQA	MC3	41.3	45.0	+3.7
TruthfulQA	Average	42.2	53.2	+11.0
Evaluation on open-ended generation (FactScore) confirms DHI improves factual precision without sacrificing content richness.
FACTSCORE	Factual Precision Score	66.3	68.1	+1.8
FACTSCORE	# Facts	46.5	49.9	+3.4

Experiment Figures

Illustration of the Causal Attention Masking Adaptation.

Main Takeaways

Targeted fine-tuning (Evil Model) provides significantly better contrastive signals than simply using smaller models (standard CD) or internal layers (DoLa).
Penalizing the correct answer (DHI) creates a more diverse and effective 'Evil' expert than training on specific hallucination types (ICD).
DHI is effective for both structured tasks (Multiple Choice) and free-form generation (Biographies).
The method scales effectively: baselines using 70B models (42.2 avg) perform worse than the proposed method using 7B models (53.2 avg) on TruthfulQA.

📚 Prerequisite Knowledge

Prerequisites

Contrastive Decoding
Causal Language Modeling
LoRA (Low-Rank Adaptation)

Key Terms

Hallucination: Inaccurate or fabricated information generated by LLMs that contradicts input, context, or established facts.

ICD: Induce-then-Contrast Decoding—a method that trains a model to hallucinate and uses it to penalize similar outputs in a base model.

Positive Model: The standard, pre-trained LLM used as the reference for factual generation.

Evil Model: A version of the LLM fine-tuned to deliberately generate hallucinations or avoid correct answers.

Contrastive Decoding: A decoding strategy that subtracts the logits of a weaker or negative model from a strong model to highlight high-quality tokens.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.

MC1/MC2/MC3: Metrics for TruthfulQA. MC1: Best answer is true. MC2: Normalized probability of true answers. MC3: Average proportion of true answers rated higher than false ones.

FactScore: A benchmark measuring factual precision in long-form text generation by breaking responses into atomic facts and verifying them.