Flame: Factuality-aware alignment for LLMs

📝 Paper Summary

Knowledge internalization (post-training) Hallucination suppression

Flame improves LLM factuality by training on the model's own generated knowledge rather than unknown human/RAG data, avoiding the hallucination induced by forcing models to recite unfamiliar facts.

Core Problem

Standard alignment (SFT + RLHF) often degrades factuality because it forces models to learn from human or RAG-generated data containing information unknown to the pre-trained model.

Why it matters:

Fine-tuning on new/unknown knowledge inadvertently encourages hallucination by teaching the model to make up information it doesn't actually 'know'
Existing RL reward models prioritize helpfulness and length over factuality, often preferring detailed but fabricated responses

Concrete Example: A pilot study on biography generation shows that fine-tuning a standard LLM on high-quality biographies generated by a Retrieval-Augmented (RAG) teacher makes the student model hallucinate *more* than the baseline, because the RAG teacher's knowledge is external to the student.

Key Novelty

Factuality-Aware Alignment (Flame)

Identifying 'fact-based' instructions via a classifier to apply specialized training only where needed
Constructing SFT data using the model's own generated responses (distilled from few-shot prompting) rather than human gold data to prevent training on unknown knowledge
Employing a specialized factuality reward model during DPO that decomposes responses into atomic facts and verifies them using retrieval

Evaluation Highlights

+5.6 FActScore improvement on the Biography generation task compared to standard alignment (SFT+DPO)
Maintains strong instruction-following capability (51.2% win rate on Alpaca Eval) while significantly reducing hallucinations
Demonstrates that training on RAG-generated data (usually considered 'higher quality') actually hurts the factuality of non-RAG models

Breakthrough Assessment

7/10

Provides a counter-intuitive but crucial insight: better training data (RAG) can worsen model factuality if the model doesn't know the underlying facts. The proposed solution is effective and practical.

⚙️ Technical Details

Problem Definition

Setting: Aligning a pre-trained Large Language Model to follow instructions while minimizing factual errors (hallucinations)

Inputs: Natural language instruction x

Outputs: Generated response y

Pipeline Flow

Instruction Classification (Fact-based vs. Non-fact-based)
SFT Data Construction (Human data for non-fact; Self-generated data for fact)
SFT Training
DPO Preference Pair Construction
DPO Training

System Modules

Instruction Classifier

Determines if an instruction requires factual accuracy

Model or implementation: SFT model (Llama-2-70B based)

SFT Data Generator

Generates training responses using the model's own internal knowledge

Model or implementation: Pre-trained Llama-2-70B

Factuality Reward Model (RM_fact)

Evaluates the factual correctness of generated responses for DPO pairs

Model or implementation: Retrieval-augmented verification pipeline

Aligned Model

Final aligned LLM for inference

Model or implementation: Llama-2-70B (Flame)

Novel Architectural Elements

Hybrid SFT data pipeline: Routes fact-based prompts to self-generated data (internal knowledge) and non-fact prompts to human data (instruction following)
Two-signal DPO: Combines standard instruction-following preference pairs with distinct factuality-driven preference pairs derived from atomic fact verification

Modeling

Base Model: Llama-2-70B

Training Method: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer factual responses over hallucinated ones.

Formally: DPO loss L_DPO = -E[log sigma(beta * log(pi(y_w|x)/pi_ref(y_w|x)) - beta * log(pi(y_l|x)/pi_ref(y_l|x)))]

Training Data:

Seed data: 3,200 instructions from Open Assistant (OASST)
SFT (Flame): Human responses for non-fact instructions; Self-generated (5-shot) responses for fact instructions
DPO (Flame): Standard instruction-following pairs + Factuality pairs (chosen via retrieval-based fact score)

Key Hyperparameters:

dpo_beta: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Fine-tuning on RAG: Flame avoids RAG data for SFT to prevent 'forcing' unknown knowledge, finding that RAG data actually increases hallucinations in standard models
vs. Standard RLHF: Flame explicitly separates factuality rewards from helpfulness rewards, creating specific preference pairs for factuality rather than mixing them into one scalar
vs. Tian et al. (2024): Tian et al. focus only on factuality DPO; Flame addresses the full pipeline (SFT + DPO) and ensures instruction following is maintained [cited in paper]
+ 1 more
vs. Kang et al. (2024): Kang et al. teach models to abstain/hedge on unfamiliar queries; Flame teaches models to output only known facts via self-generation [cited in paper]

Limitations

Relies on the accuracy of the automated factuality reward model (RM_fact); if the verifier is wrong, the signal is noisy.
Requires a high-quality pre-trained model (Llama-2-70B) to generate its own valid training data; may not work for weaker base models.
The distinction between 'fact-based' and 'non-fact-based' is binary and dependent on a classifier prompt, which may lack nuance.

Reproducibility

No replication artifacts mentioned in the paper (no code URL, no released weights). Paper relies on proprietary/internal implementation of specific retrieval tools (DRAGON+) and evaluators.

📊 Experiments & Results

Evaluation Setup

Evaluation on two distinct axes: Factuality (Biography generation) and Instruction Following (Alpaca Eval)

Benchmarks:

Biography (Factuality evaluation (Long-form generation))
Alpaca Eval (General instruction following)

Metrics:

FActScore (factuality)
Win Rate (instruction following vs baseline)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pilot study results on Biography generation demonstrating that training on RAG data hurts factuality, while training on self-generated data improves it.
Biography	FActScore	47.7	46.1	-1.6
Biography	FActScore	47.7	56.4	+8.7
Main results comparing Flame to standard alignment (SFT+DPO) on both factuality and helpfulness.
Biography	FActScore	49.1	54.7	+5.6
Alpaca Eval	Win Rate	50.0	51.2	+1.2

Main Takeaways

Fine-tuning on 'better' data (RAG outputs) harms factuality if the model doesn't possess that knowledge internally (Pilot Study).
Self-generated training data is crucial for alignment because it respects the model's knowledge boundaries.
Flame successfully decouples factuality alignment from instruction-following alignment, allowing improvements in one without degrading the other.
Standard RL rewards (helpfulness/length) negatively correlate with factuality, encouraging hallucination (Figure 1 in paper).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in LLMs
Familiarity with Direct Preference Optimization (DPO)
Knowledge of RAG (Retrieval-Augmented Generation)

Key Terms

FActScore: A metric that decomposes long-form generations into atomic facts and verifies each against a knowledge base (like Wikipedia) to measure factuality

DPO: Direct Preference Optimization—a method to align language models to preferences without training a separate reward model, using a specific loss function on preference pairs

RAG: Retrieval-Augmented Generation—augmenting LLM input with retrieved documents to improve factual accuracy

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs

atomic fact decomposition: Breaking down a complex sentence into individual, verifiable statements

self-rewarding: Using the LLM itself to evaluate the quality of its own or others' outputs during training

RLHF: Reinforcement Learning with Human Feedback—aligning models using rewards derived from human preferences

RLAIF: Reinforcement Learning with AI Feedback—similar to RLHF but using AI models to generate the feedback/preferences