FenCE: Improving Model factuality w. fine-grained critic-based evaluator

📝 Paper Summary

Hallucination suppression Confidence-based fine-tuning RLHF for Factuality

Fine-tuning language models using Direct Preference Optimization on automatically generated preference pairs—based on either external knowledge retrieval or the model's own confidence—significantly reduces factual errors in long-form generation.

Core Problem

Large language models frequently generate convincing but incorrect claims (hallucinations), and standard pre-training objectives do not sufficiently penalize these errors.

Why it matters:

Manual fact-checking is expensive and slow (e.g., 9 minutes per biography), making human-labeled preference datasets costly to acquire
Maximum likelihood pre-training encourages 'smearing' probability mass over many possible answers, leading to hallucinations when the model is uncertain or underfits
Existing RLHF methods focus on helpfulness/harmfulness but don't explicitly target factual correctness, sometimes exacerbating hallucinations

Concrete Example: When asked 'Where was Yo-Yo Ma born?', a standard model might confidently guess 'Paris' (incorrect) to minimize loss if it lacks the specific fact. A factual model should recognize its internal uncertainty and avoid the claim, but standard training doesn't distinguish between 'confidently wrong' and 'cautiously vague'.

Key Novelty

Automated Factuality Preference Tuning (FactTune)

Construct preference datasets automatically by sampling two responses from the model and ranking them based on estimated truthfulness (either via external retrieval or internal confidence)
Use Direct Preference Optimization (DPO) to fine-tune the model on these ranked pairs, teaching it to prefer more factual generation styles without needing human labels
Introduce a 'reference-free' estimation method that uses the model's own eigencan-confidence (consistency across resampled answers) as a proxy for truthfulness, eliminating the need for Wikipedia/Google

Architecture

The complete pipeline for Factuality Tuning, from sampling to scoring to DPO updates.

Evaluation Highlights

Reduces factual error rate by 58% on biography generation compared to Llama-2-chat (7B scale)
Reduces factual error rate by 40% on medical question answering compared to Llama-2-chat
FactTune-FS (reference-based) achieves 89.5% factual accuracy on biographies, outperforming RLHF (74.8%) and inference-time intervention baselines

Breakthrough Assessment

8/10

Significant because it demonstrates that costly human labeling isn't necessary for alignment on factuality. The reference-free approach is particularly promising for domains where ground truth is scarce.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a language model policy πθ to maximize factuality of generated text y given input x, without human labels

Inputs: Unlabeled prompts x (e.g., 'Write a biography of Mary Wollstonecraft')

Outputs: Long-form text response y with minimized factual errors

Pipeline Flow

Prompt Sampling (x from dataset)
Response Sampling (generate pairs y_1, y_2)
Truthfulness Estimation (score y_1, y_2 via FactScore or Model Confidence)
Preference Pair Construction (create DPO dataset)
DPO Fine-tuning (update πθ)

System Modules

Generator

Generate candidate responses for a given prompt

Model or implementation: Llama-2-7b or Llama-1-7b

Claim Extractor (Scoring / Preference Creation)

Break down long-form text into atomic factual claims

Model or implementation: GPT-3.5

Truthfulness Estimator (Ref-Based) (Scoring / Preference Creation)

Verify claims against external knowledge (Wikipedia)

Model or implementation: Fine-tuned Llama-1-7B (FactScore checker)

Truthfulness Estimator (Ref-Free) (Scoring / Preference Creation)

Estimate truthfulness via model's own confidence consistency

Model or implementation: Llama-1-7B (same as base model)

Novel Architectural Elements

Self-supervised alignment loop where the model's own confidence (or automated external checks) generates the training signal for DPO, replacing human annotators

Modeling

Base Model: Llama-2-7b and Llama-1-7b

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to prefer factual responses over hallucinated ones.

Formally: L_DPO(πθ; πref) = -E[log σ(β * log(πθ(yw|x)/πref(yw|x)) - β * log(πθ(yl|x)/πref(yl|x)))]

Training Data:

Biographies: 355 individuals (296 train, 59 test)
Medical QA: 200 conditions (150 train, 50 test)
Sampled multiple responses per prompt (10 for Bio, 6 for MedQA) to form pairs

Key Hyperparameters:

beta: Not explicitly reported in the paper
temperature: 1.0 (for sampling candidates)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLHF: FactTune specifically targets factual accuracy via automated labels, whereas standard RLHF targets general helpfulness/safety and often degrades factuality
vs. ITI/DOLA: FactTune optimizes the model weights directly via fine-tuning rather than modifying inference-time behavior; can be combined with DOLA for additive gains
vs. RARR [not cited in paper]: RARR edits outputs after generation using retrieval; FactTune optimizes the model to generate correct outputs initially

Limitations

Reference-based scoring requires a reliable knowledge base (Wikipedia) and retrieval system
Reference-free scoring relies on the model being well-calibrated; if the model is 'confidently wrong', the signal fails
Current experiments limited to 7B scale models
GPT-3.5 used for claim extraction adds cost and potential noise to the pipeline

Reproducibility

Not provided. The paper mentions using GPT-3.5 for claim extraction and FactScore for evaluation, but does not provide a repository link for the specific fine-tuning scripts or the generated preference datasets.

📊 Experiments & Results

Evaluation Setup

Long-form generation of biographies and answers to medical questions

Benchmarks:

Biographies (Long-form text generation (writing bios for valid Wikipedia entities)) [New]
Medical QA (Open-ended Question Answering (symptoms/treatments)) [New]

Metrics:

Number of Correct Facts (FactScore)
Number of Incorrect Facts (FactScore)
Percent Correct (Precision)
Human Evaluation (Accuracy)
GPT-4 Evaluation (Error Count)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against Llama-2-Chat baseline shows significant reduction in factual errors.
Biographies	Number of Incorrect Facts	6.41	4.06	-2.35
Biographies	Percent Correct	0.748	0.831	+0.083
Medical QA	Number of Incorrect Facts	5.50	3.47	-2.03
Comparison against Inference-Time baselines (ITI, DOLA) on Llama-1-7B.
Biographies	Percent Correct	0.754	0.812	+0.058
Medical QA	Percent Correct	0.633	0.707	+0.074
Reference-free FactTune-MC (Model Confidence) results.
Biographies	Percent Correct	0.568	0.783	+0.215

Experiment Figures

Scatter plot of Correct Facts vs. Incorrect Facts per response for various methods.

Correlation between FactScore-counted errors and GPT-4-counted errors.

Main Takeaways

FactTune-FS (Reference-based) consistently outperforms RLHF and decoding strategies (ITI, DOLA) in reducing factual errors.
FactTune-MC (Reference-free) effectively improves factuality using only the model's own confidence, providing a viable path for domains without external ground truth.
Factuality tuning can be combined with inference-time decoding strategies like DOLA for additive improvements.
Qualitatively, fact-tuned models adopt a more objective, less conversational style (simpler sentences, fewer casual phrases).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Language Model Calibration/Uncertainty estimation

Key Terms

DPO: Direct Preference Optimization—a stable method to fine-tune LMs on preference pairs by optimizing a classification loss, avoiding explicit reward modeling

RLHF: Reinforcement Learning from Human Feedback—training models using rewards derived from human preferences

FactScore: An automated evaluation metric that breaks text into atomic claims and verifies each against Wikipedia using a retrieval system

atomic claims: The smallest indivisible statements of fact within a longer text (e.g., 'Yo-Yo Ma plays cello' is atomic; 'Yo-Yo Ma is a French-born cellist' contains two atomic claims)

calibration: The property where a model's predicted confidence probability matches its actual accuracy frequency

FactTune-FS: The paper's method using FactScore (reference-based) to generate preference labels

FactTune-MC: The paper's method using Model Confidence (reference-free) to generate preference labels

SFT: Supervised Fine-Tuning—standard training on high-quality demonstration data

ITI: Inference-Time Intervention—a technique that shifts model activations during inference to improve truthfulness

DOLA: Decoding by Contrasting Layers—a decoding strategy that amplifies factual knowledge by contrasting outputs from different model layers

semantic entropy: A measure of uncertainty that clusters generated answers by meaning rather than exact token match

Bradley-Terry model: A statistical model predicting the probability that one item is preferred over another based on their underlying reward scores