CDT reduces hallucinations by guiding language model decoding away from a fine-tuned hallucinatory comparator and towards a truthful one, using a mixture-of-experts strategy to handle diverse task patterns.
Core Problem
LLMs suffer from multifaceted hallucinations (both intrinsic and faithful) across different tasks, and existing methods often compromise internal knowledge or fail to address coupled hallucination patterns simultaneously.
Why it matters:
Hallucinations severely limit model credibility in realistic applications by generating plausible but non-factual claims.
Secondary training for factuality is labor-intensive and can inadvertently encourage spurious patterns.
Existing editing methods may interfere with internal factual knowledge, causing performance bottlenecks on out-of-distribution tasks.
Concrete Example:In text summarization, a model like LLaMA2-7B-Chat might exhibit 'faithful hallucination' (wrongly extracting information) and 'intrinsic hallucination' (fabricating false content not in the document) simultaneously in the same response, which simple penalty methods fail to distinguish.
Key Novelty
Comparator-driven Decoding-Time (CDT) framework
Constructs two distinct comparator models (hallucinatory and truthful) via multi-task fine-tuning to model negative and positive generation attributes separately.
Uses an Instruction Prototype-guided Mixture of Experts (PME) strategy within the comparators to dynamically activate different LoRA adapters based on the specific hallucination pattern of the current task.
Modifies the final output probability distribution during inference by contrasting the target model against both comparators, penalizing the hallucinatory distribution and boosting the truthful one.
Architecture
The overall CDT framework, illustrating the parallel decoding process where the Target LLM, Truthful Comparator, and Hallucinatory Comparator all process the input. It details the Instruction Prototype-guided Mixture of Experts (PME) module within the comparators.
Evaluation Highlights
+5.49 improvement on ROUGE-L for LLaMA2-7B-Chat on the XSum summarization task compared to the base model.
Outperforms the strong baseline ICD by ~1-3 points across multiple metrics (ROUGE, BERTScore) on standard benchmarks.
Achieves consistent gains across four diverse tasks (TruthfulQA, XSum, WoW, HaluEval) using the same framework.
Breakthrough Assessment
7/10
Offers a robust decoding-time solution that handles multifaceted hallucinations via a novel mixture-of-experts approach. While effective, it relies on constructing specific comparator models, adding some complexity compared to simple penalty methods.
⚙️ Technical Details
Problem Definition
Setting: Autoregressive text generation under multi-task instructions
Inputs: Input instruction x containing query and context
Outputs: Generated response y consisting of a sequence of tokens
Identify the semantic cluster of the input instruction to guide expert selection.
Model or implementation: Gaussian Mixture Model (GMM) on instruction features
Comparators (Hallucinatory & Truthful)
Generate logits representing specific hallucination or truthfulness patterns.
Model or implementation: Base LLM with PME-guided LoRA adapters
Logit Integration (decoding)
Combine logits from target LLM and comparators to shift distribution.
Model or implementation: Mathematical operation (Equation 2 in paper)
Adaptive Plausibility Constraint (decoding)
Truncate the vocabulary to retain only high-confidence tokens from the target model, preventing the comparators from enforcing implausible tokens.
Model or implementation: Thresholding operation
Novel Architectural Elements
Dual-comparator decoding: Simultaneously using a positive (truthful) and negative (hallucinatory) comparator.
Prototype-guided Mixture of Experts (PME) in LoRA adapters: Routing inputs to specific fine-tuned experts based on instruction clusters to handle multi-task hallucination patterns.
Modeling
Base Model: LLaMA2-7B-Chat (and potentially others like Vicuna-7B, though LLaMA2 is the primary example)
Training Method: Supervised Fine-Tuning (SFT) with LoRA and Adversarial Training
Objective Functions:
Purpose: Optimize the hallucinatory comparator to mimic non-factual generation.
Formally: Standard Cross-Entropy Loss on hallucinated responses.
Purpose: Optimize the truthful comparator to resist hallucinations using adversarial perturbations.
Formally: Minimize loss on factual responses after adding gradient-based perturbations derived from hallucinated responses.
Adaptation: LoRA (Low-Rank Adaptation) with Mixture of Experts
Training Data:
Multi-task dataset containing pairs of (instruction, hallucinated response) and (instruction, factual response).
Tasks include Question Answering, Dialogue, Summarization.
Compute: Not explicitly reported in the paper (implies standard LoRA fine-tuning costs)
Comparison to Prior Work
vs. DoLa: CDT uses external fine-tuned comparators rather than internal layer contrasting, allowing for more specific attribute control.
vs. ICD: CDT uses both a hallucinatory AND a truthful comparator (ICD only uses a penalty model), and employs Mixture-of-Experts to handle diverse tasks.
vs. CAD: CDT focuses on instruction-based comparators rather than just context masking.
Limitations
Requires constructing and fine-tuning two separate comparator models, which adds training overhead compared to inference-only methods like DoLa.
Inference cost is increased because it requires forward passes through the target model and both comparators.
Effectiveness depends on the quality and diversity of the multi-task hallucination dataset used for training comparators.
Code is publicly available at https://github.com/ydk122024/CDT. The paper describes the method mathematically but lacks a detailed hyperparameter table (e.g., learning rates, LoRA rank) in the main text.
📊 Experiments & Results
Evaluation Setup
Evaluate response factuality and quality on diverse downstream tasks.
Wizard of Wikipedia (WoW) (Knowledge-grounded Dialogue)
HaluEval (Hallucination Evaluation)
Metrics:
ROUGE-L
BERTScore
Factuality Metrics (specifics not detailed in snippet but implied)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
CDT significantly improves performance across all evaluated tasks (Summarization, QA, Dialogue) compared to the base model and other decoding interventions.
The Mixture-of-Experts strategy is crucial for handling different types of hallucinations (e.g., intrinsic vs. faithful) that appear in different tasks.
The adversarial training mechanism for the truthful comparator effectively prevents overfitting and enhances robustness.
The framework maintains generation fluency while improving factuality, unlike some penalty-based methods that degrade coherence.
📚 Prerequisite Knowledge
Prerequisites
Understanding of autoregressive decoding and logits
Familiarity with Low-Rank Adaptation (LoRA)
Basic knowledge of contrastive decoding techniques
Key Terms
CDT: Comparator-driven Decoding-Time framework—the proposed method leveraging truthful and hallucinatory comparators during inference.
PME: Prototype-guided Mixture of Experts—a strategy to route inputs to specific LoRA experts based on instruction prototypes.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.
Hallucinatory Comparator: A version of the base model fine-tuned on hallucinated responses to model non-factual generation patterns.
Truthful Comparator: A version of the base model fine-tuned on factual responses (with adversarial perturbations) to model truthful patterns.
Instruction Prototype: A learned representative feature vector for a cluster of similar task instructions, used to guide the routing of inputs to experts.
Contrastive Decoding: A decoding strategy that modifies the next-token probability distribution by contrasting the logits of a strong model against a weak or amateur model.