Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models

📝 Paper Summary

Factuality Alignment Hallucination Mitigation Preference Learning

APEFT improves the generalization of factual tuning by constructing fine-grained 'atomic' preference pairs—targeting individual facts rather than whole paragraphs—to address the under-alignment problem observed in out-of-domain settings.

Core Problem

Existing preference learning methods for factuality are primarily evaluated on in-domain data, but they fail to generalize to out-of-domain (OOD) queries, where performance often stagnates or decreases.

Why it matters:

Models tuned for factuality on one task (e.g., biographies) often fail to apply that factual behavior to other domains (e.g., general knowledge questions), limiting real-world utility
Standard paragraph-level feedback is too coarse, preventing models from learning specifically *which* facts are incorrect versus correct
Current methods exhibit 'under-alignment,' where the model's behavior barely changes on OOD inputs, rather than 'over-alignment' to spurious features

Concrete Example: A model trained to be factual on biographies (In-Domain) might still hallucinate when asked about general topics like 'What are the contributions of Albert Einstein?' (Out-of-Domain), showing no improvement over the base model because it hasn't learned the general principle of factuality.

Key Novelty

Atomic Preference Enhanced Factuality Tuning (APEFT)

Decomposes general responses into atomic sentences containing single facts to isolate specific errors
Constructs 'atomic preferences' by comparing a model's incorrect generation of a specific fact against a correct version, using a knowledge detection prompt to verify the model 'knows' the fact but failed to tell it
Combines these fine-grained atomic preferences with general paragraph-level preferences during training to teach the model to attend to individual factual claims

Evaluation Highlights

APEFT improves factuality by an average of +3.45% across both In-Domain (Bio) and Out-of-Domain (FAVA, FPQA, KUQA) datasets compared to standard preference learning
On the OOD dataset FAVA, APEFT achieves a 51.5% win rate, significantly outperforming standard DPO (44.6%) and other baselines
Token distribution analysis confirms APEFT increases the number of 'shifted tokens' on OOD data, effectively mitigating the under-alignment problem

Breakthrough Assessment

7/10

Solid contribution identifying 'under-alignment' as the cause of poor OOD factuality and proposing a logical, effective solution (atomic preferences). The gains are consistent, though the scope is limited to biography-based training.

⚙️ Technical Details

Problem Definition

Setting: Aligning LLMs to factuality using preference learning (DPO, KTO, etc.) with a focus on generalization to unseen domains

Inputs: Prompt x requiring a factual response

Outputs: Generated response y aligned with external knowledge

Pipeline Flow

Data Construction: Generate biographies -> Split into atomic facts -> Verify facts -> Construct Atomic Preferences
Training: Fine-tune model using Preference Learning (DPO/KTO/etc.) on mixed General + Atomic preferences

System Modules

Preference Constructor

Create training data by decomposing responses and pairing factually correct vs. incorrect atomic statements

Model or implementation: Based on FActScore / Knowledge Detection

Factuality Tuner

Align the base model to prefer factual atomic statements

Model or implementation: LLaMA-3-8B-Instruct or LLaMA-2-7B-Chat

Novel Architectural Elements

Atomic Preference construction pipeline: Specifically targets the granularity of individual facts to enhance the signal for factuality alignment

Modeling

Base Model: LLaMA-3-8B-Instruct and LLaMA-2-7B-Chat

Training Method: Preference Learning (DPO, KTO, IPO, RSO, CPO)

Objective Functions:

Purpose: Optimize model policy to increase likelihood of preferred (factual) responses over dispreferred ones.

Formally: Varies by algorithm (e.g., DPO uses a logistic loss on the log-ratio of policy probabilities).

Adaptation: Full parameter fine-tuning

Training Data:

Constructed 2777 preference pairs for LLaMA-3-8B-Instruct
Constructed 2730 preference pairs for LLaMA-2-7B-Chat
Source: Biography generation task (names from popular people list)

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 4
gradient_accumulation_steps: 4
+ 1 more
epochs: 3

Compute: 4 Nvidia A100-80G GPUs

Comparison to Prior Work

vs. Standard DPO/KTO/etc.: APEFT changes the *data* (using atomic preferences) rather than the loss function, and can be applied on top of any of these algorithms.
vs. Fact-tuning (Tian et al. 2023): APEFT focuses on OOD generalization and atomic-level granularity rather than just ID performance.

Limitations

Training data is restricted to the biography generation task, limiting the scope of 'source' domains explored.
Requires an external knowledge source or mechanism (FActScore) to verify atomic facts during data construction.
Performance gains, while consistent, are moderate (~3-4%).

Reproducibility

Code and datasets stated to be available after acceptance. Hyperparameters provided. Base models are open weights (LLaMA-2/3).

📊 Experiments & Results

Evaluation Setup

Train on Biography generation; Evaluate on In-Domain (Bio) and Out-of-Domain (FAVA, FPQA, KUQA) datasets.

Benchmarks:

Bio (In-Domain) (Biography Generation) [New]
FAVA (OOD) (Long-form factuality on open-ended topics)
FPQA (OOD) (Answering questions with false premises)
KUQA (OOD) (Short-form knowledge questions)

Metrics:

FActScore (factuality percentage)
Win Rate (for FAVA)
Exact Match / Accuracy (for FPQA/KUQA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Initial benchmarking shows standard preference learning fails to generalize to OOD datasets.
KUQA	Accuracy	39.46	35.37	-4.09
FPQA	Accuracy	44.67	36.20	-8.47
APEFT consistently improves performance across ID and OOD datasets compared to standard tuning.
Average across 4 datasets	Factuality Score	Not reported in the paper	Not reported in the paper	+3.45
FAVA	Win Rate vs Base	44.6	51.5	+6.9

Experiment Figures

Frequency of shifted tokens (tokens with changed rank) on ID vs. OOD datasets.

Main Takeaways

Standard preference learning (DPO, KTO, etc.) often degrades or minimally improves factuality on Out-of-Domain (OOD) tasks compared to In-Domain (ID) tasks.
The primary cause of OOD failure is 'under-alignment'—the model changes its behavior too little—rather than 'over-alignment' to spurious features.
APEFT, by using atomic-level preferences, forces the model to attend to fine-grained factual details, significantly mitigating under-alignment.
Simply increasing the quantity or quality of general preference pairs does not necessarily lead to performance gains, highlighting the importance of preference granularity.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Familiarity with Preference Learning / RLHF concepts
Knowledge of token probability distributions

Key Terms

hallucination: A phenomenon where LLMs generate seemingly convincing but factually erroneous responses

preference learning: Fine-tuning models using pairs of preferred (better) and dispreferred (worse) outputs to steer behavior

under-alignment: A failure mode where the tuning process is too superficial, causing no significant behavior change in out-of-domain settings

over-alignment: A failure mode where the model learns spurious features (e.g., style) rather than the intended task, leading to poor generalization

atomic preferences: Preference pairs constructed at the granularity of individual facts/sentences rather than entire paragraphs

FActScore: An automated metric that breaks generations into atomic facts and verifies each against a knowledge base

DPO: Direct Preference Optimization—a stable method for preference learning that optimizes a classification loss without a separate reward model

shifted tokens: Tokens whose probability rank changes significantly after fine-tuning compared to the base model, used as a proxy for behavioral change