Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

📝 Paper Summary

Hallucination suppression Preference-based alignment

F-DPO modifies the Direct Preference Optimization objective to prioritize factual correctness over fluency by flipping mislabeled preference pairs and adding a margin penalty when the chosen response is hallucinated.

Core Problem

Standard preference alignment methods like DPO reinforce hallucinations because annotators often prefer fluent, confident, but factually incorrect responses over concise, accurate ones.

Why it matters:

High-stakes domains (medicine, law) require strict adherence to truth, but standard alignment prioritizes style
Existing solutions require complex auxiliary reward models, multi-stage training, or fine-grained token-level supervision
Preference datasets contain inherent noise where human or model judges systematically misrank hallucinated responses as better than factual ones

Concrete Example: When asked 'What is the capital of Australia?', annotators may prefer the incorrect but confident 'The capital is Sydney, its largest city' over the correct but simple 'Canberra'. Standard DPO reinforces the Sydney answer; F-DPO detects the factuality gap and flips the preference label.

Key Novelty

Factuality-aware Direct Preference Optimization (F-DPO)

Augment preference pairs with binary factuality labels (factual vs. hallucinated) derived from an automated judge
Apply 'Label Flipping': If a hallucinated response is preferred over a factual one, swap the preference direction so the model learns to choose the factual response
Apply 'Factuality-Conditioned Margin': Add a penalty term to the loss function that pushes the model to distinguish factuality more aggressively than standard preferences

Architecture

Overview of F-DPO method contrasting it with standard DPO.

Evaluation Highlights

Reduces hallucination rate by 5x on Qwen3-8B (from 0.424 to 0.084) compared to the base model
Achieves 0.008 hallucination rate on Qwen2.5-14B, nearly an order of magnitude improvement over the base model
Improves TruthfulQA MC2 accuracy by +49% (0.357 to 0.531) on Qwen2.5-14B, showing strong out-of-distribution generalization

Breakthrough Assessment

8/10

Simple yet highly effective modification to DPO that solves a critical alignment failure mode (style-over-substance) without auxiliary reward models or complex pipelines.

⚙️ Technical Details

Problem Definition

Setting: Preference optimization where preference pairs (y_w, y_l) may be misaligned with factuality labels (h_w, h_l)

Inputs: Prompt x, response pair (y_w, y_l), binary factuality labels (h_w, h_l)

Outputs: Optimized policy π_θ

Pipeline Flow

Data Augmentation (Generate synthetic hallucinations)
Factuality Annotation (Label responses with binary judges)
Label Flipping (Correct misordered pairs)
F-DPO Training (Optimize policy with margin penalty)

System Modules

Factuality Judge

Assign binary factuality labels (0=factual, 1=hallucinated) to responses

Model or implementation: GPT-4o-mini

Policy Model

Language model being optimized to generate factual responses

Model or implementation: Various (Llama-3, Qwen2, Gemma-2)

Novel Architectural Elements

Factuality-aware loss function modification that dynamically adjusts the margin based on binary factuality labels

Modeling

Base Model: Evaluated on 7 models including Llama-3.2-1B, Gemma-2-2B, Qwen2-7B, Qwen3-8B, Llama-3-8B, Gemma-2-9B, Qwen2.5-14B

Training Method: F-DPO (Factuality-aware Direct Preference Optimization)

Objective Functions:

Purpose: Maximize likelihood of preferred response while penalizing hallucinations.

Formally: L_F-DPO = -log σ(β log(π(yw)/πref(yw)) - β log(π(yl)/πref(yl)) - λ(hl - hw))
Purpose: Ensure chosen response is always more factual.

Formally: If h_w=1 and h_l=0, swap y_w and y_l and update h labels accordingly.

Training Data:

Skywork Reward-Preference corpus (80K pairs)
Augmented with synthetic hallucinations generated by LLM
Final dataset: 45K pairs after balancing and filtering

Key Hyperparameters:

beta: 0.1
lambda: 100 (factuality penalty strength)
learning_rate: 5e-7
+ 2 more
batch_size: 128
max_length: 2048

Compute: Conducted on a GPU cluster (specific hours not reported)

Comparison to Prior Work

vs. Standard DPO: F-DPO adds label flipping and factuality margin to correct preference noise
vs. SafeDPO: Targets factuality instead of safety; explicitly flips mislabeled pairs unlike SafeDPO which focuses on margin constraints
vs. Mask-DPO: Single-stage response-level optimization without fine-grained token-level annotations
+ 1 more
vs. FLAME [not cited in paper]: Single-stage preference learning rather than multi-stage SFT and RL pipelines

Limitations

Relies on binary factuality labels, which may be too coarse for nuanced errors
Requires an automated judge (GPT-4o-mini) for data annotation, inheriting its biases
Trade-off between absolute performance and data efficiency when removing hallucinated-hallucinated pairs
Evaluated primarily on QA tasks; applicability to creative writing or reasoning unclear

Reproducibility

Code website mentioned but URL not explicitly provided in text. Data construction pipeline uses GPT-4o-mini. Detailed hyperparameters provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Factuality evaluation using LLM-as-judge and benchmarks

Benchmarks:

Held-out Skywork Subset (Preference ranking / Factuality scoring)
TruthfulQA (Multiple-choice and Generation)

Metrics:

Factuality Score (0-10 judge)
Hallucination Rate (% scoring < 5)
Win Rate
MC1/MC2 Accuracy (TruthfulQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
F-DPO consistently reduces hallucination rates across all model sizes compared to base models and standard DPO.
Held-out Skywork	Hallucination Rate	0.424	0.084	-0.340
Held-out Skywork	Hallucination Rate	0.418	0.084	-0.334
Held-out Skywork	Factuality Score	5.26	7.90	+2.64
Out-of-distribution evaluation on TruthfulQA shows F-DPO generalizes better than baselines.
TruthfulQA	MC1 Accuracy	0.500	0.585	+0.085
TruthfulQA	MC2 Accuracy	0.357	0.531	+0.174
TruthfulQA	MC1 Accuracy	0.472	0.585	+0.113

Experiment Figures

Impact of factuality penalty strength (lambda) on reward margins across models.

Dual-axis plot for Qwen2.5-14B showing Factuality Score and Win Rate as a function of lambda.

Main Takeaways

Standard DPO frequently degrades factuality because it rewards fluent but hallucinated responses inherent in preference data.
Label flipping and margin penalties are complementary: flipping corrects data noise, while margin penalties amplify the signal for factual responses.
The method is highly data-efficient: achieving 95% of peak performance with only 25% of the training data.
Larger models (e.g., Qwen2.5-14B) benefit more from F-DPO, likely due to better parametric knowledge retrievability.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO) loss function
Reinforcement Learning from Human Feedback (RLHF) concepts
Bradley-Terry preference model

Key Terms

DPO: Direct Preference Optimization—an alignment method optimizing policy directly on preference pairs without a separate reward model

RLHF: Reinforcement Learning from Human Feedback—training models to align with human goals using reward models and policy optimization

Hallucination: Generated content that is fluent and confident but factually incorrect

SFT: Supervised Fine-Tuning—initial training phase on high-quality instruction-response pairs

MC1/MC2: Metrics from TruthfulQA benchmark; MC1 measures if the single best answer is correct, MC2 measures the probability assigned to all true answers

Label Flipping: A mechanism that swaps the 'winner' and 'loser' labels in a preference pair if the original winner is hallucinated and the loser is factual