Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

📝 Paper Summary

Hallucination suppression Preference Learning / Alignment

Mask-DPO improves LLM factuality by using sentence-level annotations to mask out incorrect sentences in preferred responses and correct sentences in non-preferred responses during Direct Preference Optimization training.

Core Problem

Standard DPO optimizes at the response level, meaning it inadvertently encourages hallucinations present in 'preferred' responses and penalizes factual statements present in 'non-preferred' responses.

Why it matters:

LLM responses often contain a mix of truth and hallucination, creating ambiguity when an entire response is labeled simply as 'better' or 'worse'
This noise in the training signal limits the effectiveness of alignment, preventing models from distinguishing fine-grained factual nuances
Existing methods rely on coarse response-level rewards, which fails to precisely target the specific sentences responsible for hallucinations

Concrete Example: If a preferred response contains 9 correct sentences and 1 hallucination, vanilla DPO increases the probability of the hallucination along with the truth. Conversely, if a rejected response has mostly errors but 1 correct fact, DPO decreases the probability of that correct fact.

Key Novelty

Mask-DPO (Masked Direct Preference Optimization)

Introduces a masking mechanism into the DPO loss function that leverages sentence-level factuality annotations
Prevents the model from learning from incorrect sentences within 'winning' samples by masking their contribution to the loss
Prevents the model from being penalized for correct sentences within 'losing' samples, resolving the ambiguity of mixed-quality responses

Architecture

Conceptual comparison between standard DPO and Mask-DPO. Standard DPO optimizes the entire preferred response (green) and suppresses the entire rejected response (red). Mask-DPO applies a mask based on sentence-level factuality.

Evaluation Highlights

Improves Llama3.1-8B-Instruct on the ANAH test set from 49.19% to 77.53%, surpassing the larger Llama3.1-70B-Instruct (53.44%)
Outperforms standard DPO (68.44%) and FactTune (56.83%) on in-domain factuality benchmarks
Generalizes to out-of-domain biography generation, improving FactScore from 30.29% to 39.39% without training on biography data

Breakthrough Assessment

8/10

Offers a simple yet highly effective modification to DPO that addresses a fundamental flaw in response-level preference learning for factuality. Significant gains over much larger models.

⚙️ Technical Details

Problem Definition

Setting: Fine-grained factuality alignment using pairwise preference data with sentence-level supervision

Inputs: Prompt x and a pair of responses (y_w, y_l) where y_w is generally more factual than y_l

Outputs: Optimized policy model π_θ that minimizes hallucinations

Pipeline Flow

Data Construction: Generate candidate responses -> Annotate sentence-level factuality -> Construct pairs
Mask-DPO Training: Optimize policy using masked DPO objective

System Modules

Hallucination Annotator

Annotate each sentence in candidate responses as factual or hallucinated

Model or implementation: ANAH-v2

Policy Model

Generate responses; optimized to maximize likelihood of factual segments

Model or implementation: Llama3.1-8B-Instruct

Novel Architectural Elements

Modification of DPO loss function to include a binary mask M that filters token-level gradients based on sentence correctness

Modeling

Base Model: Llama3.1-8B-Instruct

Training Method: Mask-DPO (Masked Direct Preference Optimization)

Objective Functions:

Purpose: Optimize policy to prefer factual segments while ignoring ambiguous signals.

Formally: L_Mask-DPO = -E[log σ(β * (M_w(log π_θ(y_w|x) - log π_ref(y_w|x)) - M_l(log π_θ(y_l|x) - log π_ref(y_l|x))))]

Training Data:

Training set: 8046 questions from ANAH-v2 dataset
Top-k sampling (K candidates) from policy model
Annotated by ANAH-v2 to create preference pairs based on factuality score ratio

Key Hyperparameters:

beta: Not explicitly reported in the paper (standard DPO parameter)
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. FactTune: FactTune optimizes the whole response, while Mask-DPO masks out incorrect sentences in winners and correct sentences in losers
vs. Vanilla DPO: Mask-DPO introduces sentence-level supervision to resolve ambiguity in mixed-quality pairs

Limitations

Reliance on the quality of the reward model (ANAH-v2); if annotations are wrong, masks will be wrong
Experiments limited to Llama3.1-8B-Instruct as the base model
Scaling laws investigated only for topic/question count, not model size

Reproducibility

Code: https://github.com/open-compass/ANAH

Code is available at https://github.com/open-compass/ANAH. ANAH-v2 data is used for training. Hyperparameters like learning rate and beta are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Factuality alignment on in-domain (ANAH) and out-of-domain (Biography) datasets

Benchmarks:

ANAH-v2 Test Set (Generative QA (in-domain))
Biography (Biography Generation (out-of-domain))

Metrics:

ANAH-v2 Score (ratio of non-hallucinated sentences)
FactScore (percentage of correct atomic facts)
Statistical methodology: Reported mean value after five replications

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
In-domain performance on ANAH test set shows Mask-DPO significantly outperforming baselines and larger models.
ANAH-v2 Test Set	ANAH-v2 Score	49.19	77.53	+28.34
ANAH-v2 Test Set	ANAH-v2 Score	68.44	77.53	+9.09
ANAH-v2 Test Set	FactScore	22.67	25.56	+2.89
Out-of-domain performance on Biography dataset showing generalization capabilities.
Biography	FactScore	30.29	39.39	+9.10
Biography	FactScore	37.97	39.39	+1.42
Ablation study on the masking mechanism.
ANAH-v2 Test Set	ANAH-v2 Score	68.44	77.53	+9.09

Main Takeaways

Mask-DPO significantly improves factuality over vanilla DPO by resolving training signal ambiguity in mixed-quality responses.
The method enables an 8B model to surpass a 70B model in fine-grained factuality benchmarks.
Generalization is strong: alignment on one dataset (ANAH) improves factuality on unseen topics and out-of-domain tasks (Biography).
Scaling the number of topics in training data is more effective than scaling the number of questions per topic for improving generalization.
Sampling preference pairs from the policy model itself works better than sampling from a different model or using different contexts.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Reinforcement Learning from Human Feedback (RLHF)
Kullback–Leibler (KL) divergence

Key Terms

DPO: Direct Preference Optimization—an alignment method that optimizes a policy directly on preference pairs without an explicit reward model loop

Mask-DPO: The proposed method which applies masks to the DPO objective to ignore incorrect parts of preferred answers and correct parts of rejected answers

FactScore: A metric that decomposes a generation into atomic facts and verifies what percentage are supported by a knowledge source (e.g., Wikipedia)

ANAH-v2: A fine-grained hallucination annotation model and dataset used to label sentence-level factuality

RLHF: Reinforcement Learning from Human Feedback—generic framework for aligning models using rewards derived from human preferences

hallucination: Generated content that is nonsensical or unfaithful to the source/world knowledge

FactTune: A baseline method that uses DPO for factuality alignment but relies on response-level factuality scores rather than masked sentence-level optimization

policy model: The language model being trained to generate responses

reference model: The original version of the model before alignment, used to prevent the trained model from drifting too far (via KL penalty)