Zhejiang Laboratory,
The Chinese University of Hong Kong,
Zhejiang University
arXiv
(2024)
FactualityBenchmark
📝 Paper Summary
Hallucination suppressionRepresentation editing
Iter-AHMCL reduces hallucinations by fine-tuning LLMs with contrastive learning guided by dedicated positive (truthful) and negative (hallucinating) proxy models that are iteratively updated.
Core Problem
LLMs suffer from hallucinations where they fabricate facts, and standard fine-tuning methods to fix this often cause catastrophic forgetting of general capabilities.
Why it matters:
Factual inaccuracies in high-stakes fields like scientific research or medicine can degrade application quality and erode user trust
Existing sample-level guidance methods (using static vectors) are prone to overfitting and depend heavily on hyperparameter tuning
Alignment techniques often tradeoff between reducing hallucinations and maintaining the model's original language modeling strengths
Concrete Example:When asked a question that induces hallucination, a standard model might confidently fabricate an answer. Existing editing methods like LoRRA use a fixed vector direction calculated from data samples, which might be imprecise for unseen inputs. Iter-AHMCL instead uses a trained 'negative model' that specifically tries to hallucinate, providing a dynamic negative reference to push the main model away from.
Replaces static vector guidance with dynamic model-level guidance: trains separate 'positive' (truthful) and 'negative' (hallucinatory) LoRA adapters to generate reference representations
Uses these reference models to define a contrastive loss that pulls the target model's representations toward the positive model and pushes them away from the negative model
Updates the guidance models iteratively: as the main model improves, the positive guidance model is updated to be even more truthful, creating a moving target for continuous improvement
Architecture
The overall procedure of Iter-AHMCL, illustrating the data construction, guidance model training, and the iterative fine-tuning loop.
Evaluation Highlights
+10.1 point average improvement on TruthfulQA benchmark across four foundation models (LLaMA2, Alpaca, LLaMA3, Qwen)
Effectively reduces hallucination while maintaining general capabilities, verified through comprehensive experiments on multiple LLMs
Demonstrates that model-level guidance outperforms sample-level vector guidance (standard LoRRA) in separating truthful and untruthful directions
Breakthrough Assessment
7/10
Offers a strong methodological improvement over static representation editing (LoRRA) by using dynamic proxy models. The iterative update strategy is a logical and effective extension for alignment.
⚙️ Technical Details
Problem Definition
Setting: Fine-tuning a pre-trained LLM to reduce the likelihood of generating non-factual content while preserving general capabilities
Inputs: Instruction prompts prone to eliciting hallucinations (e.g., from PKU-SafeRLHF)
Outputs: Truthful responses that align with factual reality
Pipeline Flow
Data Preparation (Construct Positive/Negative/Neutral Triplets)
Guidance Model Pre-training (Train Positive and Negative LoRA Adapters)
Iterative Fine-tuning (Edit Target Model Representations using Guidance Models)
System Modules
Positive Guidance Model (Guidance Generation)
Generates latent representations that are biased towards truthfulness
Model or implementation: Same architecture as base LLM + LoRA adapter
Negative Guidance Model (Guidance Generation)
Generates latent representations that are biased towards hallucination
Model or implementation: Same architecture as base LLM + LoRA adapter
Target Model (LLM)
The main model being fine-tuned to reduce hallucination
Model or implementation: Base LLM (e.g., LLaMA2, Qwen) being edited
Novel Architectural Elements
Dual-model guidance system: simultaneous use of a dedicated 'hallucinating' model and 'truthful' model to define the contrastive direction
Iterative update loop: The target model eventually replaces the positive guidance model in subsequent rounds to progressively refine the 'truthful' direction
Modeling
Base Model: Evaluated on LLaMA2, Alpaca, LLaMA3, and Qwen
Training Method: Iterative Contrastive Representation Editing
Objective Functions:
Purpose: Minimize distance to positive model representations and maximize distance to negative model representations.
vs. LoRRA: Uses full guidance models (dynamic) instead of static sample vectors, allowing better generalization
vs. Standard Contrastive Learning: Operates at the model/representation level with iteratively updated guidance models
vs. RLHF [not cited in paper]: Focuses specifically on representation editing to preserve original capabilities rather than global policy optimization
Limitations
Depends on the quality of the PKU-SafeRLHF dataset for constructing guidance templates
Computational cost is higher than static editing because it requires maintaining/inferencing guidance models during training
Hyperparameters (alpha, beta) for balancing loss terms likely require tuning
Reproducibility
Code and models are stated to be released upon publication (not currently available). Dataset (PKU-SafeRLHF) is public. Detailed hyperparameters like learning rates or batch sizes are not explicitly listed in the text.
📊 Experiments & Results
Evaluation Setup
Fine-tuning foundation models on hallucination-related data and evaluating on standard truthfulness benchmarks
Benchmarks:
TruthfulQA (Hallucination/Factuality Evaluation)
Metrics:
TruthfulQA Score (MC1/MC2 or generative score not specified, likely MC given 'points')
General capability metrics (implied, specific metric names not in snippet)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
TruthfulQA
Score
Not reported in the paper
Not reported in the paper
+10.1
Experiment Figures
Illustrative example of data construction showing the Instruction, Positive Prompt (Truthful), and Negative Prompt (Untruthful) templates.
Main Takeaways
Iter-AHMCL achieves a significant 10.1 point average improvement on TruthfulQA across four major LLMs.
The iterative strategy combined with model-level contrastive learning is effective at reducing hallucinations.
The method preserves general capabilities while addressing hallucinations, avoiding the catastrophic forgetting often seen in standard alignment.
📚 Prerequisite Knowledge
Prerequisites
Contrastive Learning principles
Low-Rank Adaptation (LoRA) for LLMs
Representation Engineering/Editing in LLMs
Key Terms
Iter-AHMCL: Iterative Model-level Contrastive Learning—the proposed method using dynamic positive/negative models to guide representation editing
LoRRA: Low-Rank Representation Adaptation—a baseline method that uses static data differences to guide representation editing
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that injects trainable rank decomposition matrices into transformer layers
PKU-SafeRLHF: A dataset containing safe and unsafe (or truthful/hallucinated) response pairs used for alignment training
Contrastive Learning: A learning paradigm that encourages a model to pull representations of similar (positive) pairs together and push dissimilar (negative) pairs apart
Catastrophic Forgetting: The tendency of an artificial neural network to abruptly forget previously learned information upon learning new information
Guidance Model: A specialized version of the LLM trained specifically to exhibit either high truthfulness (positive) or high hallucination (negative) to serve as a reference