Learning to Reason for Factuality

📝 Paper Summary

Long-form factuality Reasoning Large Language Models (R-LLMs) Reinforcement Learning for Factuality

The paper proposes an online Reinforcement Learning framework with a multi-component reward function (precision, detail, relevance) to train Reasoning LLMs to generate factual, detailed, and helpful long-form responses.

Core Problem

Current Reasoning LLMs (R-LLMs) like DeepSeek-R1 significantly hallucinate more than non-reasoning models on long-form factuality tasks, and existing offline RL methods for factuality lead to reward hacking (e.g., extremely short responses).

Why it matters:

R-LLMs are increasingly entrusted with high-stakes tasks where factuality is critical, yet they currently hallucinate 10-13 percentage points more than base models
Existing automatic factuality evaluations (e.g., FActScore) are too slow for online RL loops and lack reliable recall metrics, causing models to optimize for precision by generating mostly empty answers
Offline RL (DPO) on factuality data degrades general response quality and helpfulness

Concrete Example: When optimizing solely for factual precision, a model might answer 'Who is Leon Wildes?' with a generic, safe response about immigration law that is factually true but irrelevant to the specific person, effectively hacking the metric.

Key Novelty

Online RL for Factual Reasoning (SFT + GRPO)

Introduces a composite reward function balancing three competing objectives: factual precision (correctness), detail level (quantity of facts), and answer relevance (helpfulness)
Implements a high-speed, parallelized version of the VeriScore evaluation metric to enable real-time reward calculation during online RL training loops
Applies Group Relative Policy Optimization (GRPO) to long-form factuality, shifting from offline preference optimization (DPO) to on-policy learning

Evaluation Highlights

Achieves 68.1% average factual precision across six benchmarks, a +23.1 point improvement over the Llama-3.1-8B-Instruct base model
Increases response detail level by 23% (more factual claims generated) compared to the base model, avoiding the brevity penalty common in prior work
Maintains >50% win rate against the base model on helpfulness evaluations, unlike offline DPO baselines which degraded to ~37% win rate

Breakthrough Assessment

8/10

Significant because it successfully applies online RL (GRPO)—usually reserved for math/code—to open-ended factuality by solving the reward latency and hacking problems. Demonstrates strong empirical gains over both base models and offline RL.

⚙️ Technical Details

Problem Definition

Setting: Open-ended long-form question answering requiring factual accuracy

Inputs: Fact-seeking prompt x

Outputs: Reasoning chain y_cot and final answer y_ans

Pipeline Flow

Prompt Generation (Llama 4 generates synthetic prompts)
SFT (Base model fine-tuned on seed factual reasoning data)
Online RL Loop (Generate Group -> Compute Rewards -> GRPO Update)

System Modules

Policy Model

Generates the reasoning trace and final answer

Model or implementation: Llama-3.1-8B-Instruct (initialized via SFT)

VeriScore (Optimized) (Reward Calculation)

Calculates factual precision and detail rewards

Model or implementation: Llama-3.3-70B-Instruct (as claim extractor and verifier)

Relevance Judge (Reward Calculation)

Checks if the answer is relevant and helpful compared to a reference response

Model or implementation: LLM-as-a-judge

Novel Architectural Elements

Three-component reward function aggregating Factual Precision, Log-scaled Detail, and Relevance into a single scalar
Integration of a latency-optimized retrieval-based factuality evaluator (VeriScore) directly into the online RL inner loop

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize composite reward encompassing factuality, detail, and relevance.

Formally: R(y|x) = F/(T+1) + lambda * log(1+F) + mu * 1_{y > y_ref}
Purpose: Optimize policy based on relative advantage within a group.

Formally: GRPO loss minimizing -E[sum(A_i * clip(ratio))] (standard GRPO objective)

Adaptation: Full fine-tuning

Training Data:

7k synthetic prompts generated by Llama 4 using WildChat and LongFact seeds
3k prompts for SFT split, 4k prompts for RL split

Key Hyperparameters:

lambda: Controls weight of detail reward
mu: Controls weight of relevance/quality reward
tau_m: 0.1 (DPO margin threshold)
+ 1 more
tau_l: 0.1 (DPO length difference threshold)

Compute: VeriScore implementation uses parallelized Llama-3.3-70B-Instruct workers via vLLM/Matrix; verify speed ~5 seconds per response vs 2 minutes original

Comparison to Prior Work

vs. DeepSeek-R1: Directly targets factuality via online RL rather than just math/code reasoning
vs. FLAME: Uses online RL (GRPO) instead of offline DPO; uses generated prompts rather than filtered OpenAssistant data
vs. Standard RLHF: Optimizes towards a complex, retrieval-augmented factuality metric rather than just a preference model
+ 1 more
vs. FactScore/SAFE [evaluation]: Implements a 30x faster version suitable for training loops

Limitations

Relies on automatic evaluation (VeriScore) as a ground-truth reward, which may have its own errors
Computational cost of retrieval-based reward calculation is high compared to standard preference models
Training prompts are synthetically generated, potential distribution shift from real user queries
Requires a reference model for the relevance component of the reward

Reproducibility

Code: https://github.com/facebookresearch/factual-reasoning

Code availability listed as 'not yet released' (referenced as https://github.com/facebookresearch/factual-reasoning in text, but likely placeholder or internal). Uses proprietary models (Llama 4) for data generation. Relies on Google Search (Serper API) which is a paid external dependency.

📊 Experiments & Results

Evaluation Setup

Long-form generation evaluated on factual precision, detail level (number of facts), and helpfulness

Benchmarks:

LongFact-Objects (Long-form factual generation)
FAVA (Fine-grained hallucination detection)
AlpacaFact (Fact-seeking instruction following)
Biography (Biographical generation)
FactBench-Hard (Challenging factual prompts)
Factory-Hard (Difficult prompts where frontier LLMs fail)

Metrics:

Factual Precision (Prec.)
Detail Level (Dtl., number of supported claims)
Win Rate (WR, AlpacaEval style vs base model)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Benchmarking existing R-LLMs reveals they hallucinate significantly more than their non-reasoning counterparts.
Average across 6 datasets	Factual Precision	62.9	54.7	-8.2
Average across 6 datasets	Factual Precision	51.4	38.3	-13.1
Main comparison shows the proposed Online RL method (SFT+GRPO) outperforms offline baselines in precision without sacrificing detail or helpfulness.
Average across 6 datasets	Factual Precision	45.0	68.1	+23.1
Average across 6 datasets	Detail Level (# claims)	23.5	29.0	+5.5
Average across 6 datasets	Win Rate (vs Base)	50.0	54.4	+4.4
Average across 6 datasets	Win Rate (vs Base)	37.8	54.4	+16.6

Main Takeaways

Current Reasoning LLMs (DeepSeek-R1, QwQ) trade off factuality for reasoning capabilities, showing higher hallucination rates than base models.
Offline RL (DPO) can improve precision but suffers from 'reward hacking' via length reduction, leading to terse, unhelpful responses (low Win Rate).
The proposed 3-part reward (Precision, Detail, Relevance) combined with online GRPO successfully improves factuality and detail simultaneously.
Optimized VeriScore enables calculating expensive retrieval-based rewards within the training loop (30x speedup).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (on-policy vs. offline, reward hacking)
Chain-of-Thought (CoT) prompting
LLM evaluation metrics (FActScore, VeriScore)

Key Terms

R-LLMs: Reasoning Large Language Models—models that generate a long chain-of-thought 'thinking process' before the final answer (e.g., OpenAI o1, DeepSeek-R1)

GRPO: Group Relative Policy Optimization—an online RL algorithm that normalizes rewards within a sampled group of outputs for the same prompt to reduce variance

VeriScore: An automatic evaluation framework for long-form factuality that extracts atomic claims and verifies them using search engine results

DPO: Direct Preference Optimization—an offline method aligning models to preferences without an explicit reward model loop

RLHF: Reinforcement Learning from Human Feedback—aligning models using human preference data

LLM-as-a-judge: Using a strong LLM (like GPT-4) to evaluate the quality of text generated by another model

SFT: Supervised Fine-Tuning—training the model on high-quality examples before applying RL

reward hacking: When an RL agent exploits flaws in the reward function to get a high score without actually solving the task (e.g., generating very short answers to maximize precision)

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps

atomic claim: A single, verifiable fact extracted from a longer sentence or paragraph