FactAlign: Long-form Factuality Alignment of Large Language Models

📝 Paper Summary

Hallucination suppression Alignment without Reinforcement Learning

FactAlign improves the factual accuracy of long-form responses by aligning models using fine-grained, sentence-level signals from an automatic factuality evaluator rather than just binary response-level feedback.

Core Problem

LLMs frequently hallucinate in long-form responses, and standard alignment methods (like RLHF or DPO) typically use coarse-grained response-level signals that fail to capture specific factual errors within a long text.

Why it matters:

Long-form generation makes factuality assessment complex, as a response can be partially correct and partially hallucinated
Current alignment methods often sacrifice helpfulness to improve factuality, or vice versa (the alignment tax)
Reliability is a crucial requirement for real-world adoption, but quantifying and improving long-form factuality remains non-trivial

Concrete Example: In a long biography of a scientist, a model might get the birth date right but hallucinate the university they attended. A standard response-level reward might label the whole response 'bad' (losing good info) or 'good' (reinforcing the hallucination), whereas FactAlign identifies and targets specifically the sentence containing the wrong university.

Key Novelty

FactAlign with fKTO (fine-grained Kahneman-Tversky Optimization)

Extends the KTO alignment algorithm to operate at the sentence level (fKTO), treating each sentence as a mini-completion to be optimized based on its individual factual precision
Combines this fine-grained objective with a response-level objective to balance factual precision with overall helpfulness and recall
Utilizes an iterative self-training loop where the model generates responses, an automatic evaluator scores them, and the model is re-aligned on its own high-quality outputs

Architecture

The FactAlign framework workflow, illustrating the generation, evaluation, and multi-granularity alignment process.

Evaluation Highlights

+13.5% improvement in Factual F1 score on the LongFact-Concepts benchmark compared to the base model (Llama-3-8B-Instruct)
Outperforms standard DPO and KTO baselines on factual precision while maintaining or improving helpfulness (win-rate against base model)
Achieves superior factuality-helpfulness trade-offs compared to FactTune and other alignment baselines across open-domain and information-seeking tasks

Breakthrough Assessment

7/10

Solid methodology extending KTO to fine-grained signals. While it relies on existing evaluation pipelines (FactScore), the application to sentence-level alignment without pairwise preference data is a practical advancement for long-form factuality.

⚙️ Technical Details

Problem Definition

Setting: Aligning an LLM to generate long-form responses that maximize factual recall and precision with respect to a knowledge corpus (Wikipedia)

Inputs: User prompt x

Outputs: Long-form response y consisting of multiple sentences

Pipeline Flow

Response Generation (Model generates y given x)
Factuality Evaluation (Decomposition → Search → Assessment)
Alignment Training (Optimization via combined response-level and sentence-level KTO)

System Modules

Factuality Evaluator

Decompose response into atomic statements, retrieve evidence from Wikipedia, and assign binary supported/unsupported labels

Model or implementation: GPT-3.5-Turbo (for decomposition and assessment)

Policy Model

Generate long-form responses; updated during training to minimize alignment loss

Model or implementation: Llama-3-8B-Instruct

Novel Architectural Elements

Sentence-level alignment objective: The loss function treats individual sentences as distinct optimization units (chosen/rejected) within the KTO framework based on their atomic factuality scores

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: FactAlign (Iterative alignment using fKTO + KTO)

Objective Functions:

Purpose: Optimize response-level quality (helpfulness and overall factuality).

Formally: Standard KTO loss L_KTO on the full response y, labeled chosen if Factual F1 > threshold t.
Purpose: Optimize sentence-level factuality (fine-grained precision).

Formally: fKTO loss on individual sentences s_i, labeled chosen if average precision of atomic statements > threshold t_s.
Purpose: Combine coarse and fine signals.

Formally: L = L_KTO + lambda * L_fKTO

Adaptation: Full fine-tuning (implied, LoRA not explicitly mentioned for final training)

Training Data:

Iterative generation: Model generates responses to prompts
Filtering: Responses labeled 'chosen' if Factual F1 > 0.5 (response level) or Average Precision > 0.5 (sentence level)

Key Hyperparameters:

learning_rate: 5e-7
batch_size: 64
beta: 0.1
+ 4 more
epochs: 1
lambda (sentence loss weight): 1.0
threshold_response (t): 0.5
threshold_sentence (t_s): 0.5

Compute: 8 H100 GPUs for training

Comparison to Prior Work

vs. FactTune: FactAlign uses KTO (unary labels) instead of DPO (pairs), allowing sentence-level optimization without needing to pair 'good' and 'bad' sentences
vs. Standard KTO: FactAlign introduces fKTO to leverage fine-grained sentence-level signals rather than just whole-response signals
vs. RLHF: FactAlign avoids the complexity of training a separate reward model and the instability of PPO [not cited in paper as direct baseline, but implicit context]

Limitations

Relies on Wikipedia as the sole knowledge source, limiting applicability to topics not covered there
Dependent on the accuracy of the automatic factuality evaluator (GPT-3.5-based), which incurs cost and potential errors
Evaluation is computationally expensive (~$4 per generation according to prior work cited)
Does not explicitly model dependencies between sentences in the fine-grained loss (sentences treated as independent completions given history)

Reproducibility

Code: https://github.com/MiuLab/FactAlign

Code, datasets, and trained models are publicly available at https://github.com/MiuLab/FactAlign. The method uses GPT-3.5-Turbo for evaluation signals, representing a closed-source dependency and cost for reproduction.

📊 Experiments & Results

Evaluation Setup

Long-form generation on open-domain and information-seeking prompts, evaluated against Wikipedia knowledge

Benchmarks:

LongFact-Concepts (Long-form factuality generation (Concepts subset))
FactScore-Bio (Biography generation)

Metrics:

Factual Precision (percentage of supported atomic statements)
Factual F1 (harmonic mean of precision and recall@K)
Response Length (number of atomic statements)
GPT-4 Win-rate (helpfulness assessment)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on LongFact-Concepts showing FactAlign improves both factuality (F1) and helpfulness compared to the base model and standard alignment baselines.
LongFact-Concepts	Factual F1	36.2	41.1	+4.9
LongFact-Concepts	Factual F1	39.5	41.1	+1.6
LongFact-Concepts	Factual Precision	69.1	73.2	+4.1
FactScore-Bio	Factual Precision	74.7	83.5	+8.8
FactScore-Bio	Number of Facts (Recall proxy)	50.1	53.2	+3.1
Ablation studies demonstrating the specific contribution of the sentence-level fKTO loss.
LongFact-Concepts	Factual F1	39.6	41.1	+1.5

Experiment Figures

A motivating example comparing a Base Model response vs. a FactAlign response.

Main Takeaways

FactAlign improves factual F1 by encouraging the model to generate more correct facts rather than just shortening responses to minimize errors (a common failure mode of precision-only optimization).
Fine-grained (sentence-level) alignment via fKTO provides superior signals compared to coarse (response-level) alignment, allowing the model to distinguish factual from non-factual parts of a single response.
The method maintains or improves general helpfulness (GPT-4 win rate) while improving factuality, mitigating the 'alignment tax' often observed where factual models become terse or unhelpful.
Iterative training is effective: the model improves by training on its own high-quality generations filtered by the factuality evaluator.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model alignment (RLHF, DPO)
Familiarity with atomic fact decomposition and verification
Basic knowledge of loss functions for preference optimization

Key Terms

KTO: Kahneman-Tversky Optimization—an alignment loss function that uses binary 'chosen/rejected' labels per sample rather than requiring pairs of preferred/dispreferred responses

atomic statement: The smallest indivisible unit of information in a sentence that can be independently verified as true or false

fKTO: Fine-grained KTO—the authors' proposed algorithm that applies KTO loss to individual sentences based on their specific factuality scores

FactScore: An evaluation metric that decomposes long text into atomic facts and verifies each against a knowledge base (like Wikipedia) to calculate precision

Factual F1: A metric balancing factual precision (accuracy of statements) and recall (quantity of correct information), punishing models that are accurate but say very little

self-contained statement: A rewritten version of a sentence or clause where pronouns are resolved (e.g., 'He went there' -> 'Obama went to Harvard') so it can be verified independently