← Back to Paper List

Learning to Reason for Factuality

X Chen, I Kulikov, VP Berges, B Oğuz, R Shao…
Fundamental AI Research at Meta, University of Washington
arXiv, 8/2025 (2025)
Factuality Reasoning RL Benchmark

📝 Paper Summary

Long-form factuality Reasoning Large Language Models (R-LLMs) Reinforcement Learning for Factuality
The paper proposes an online Reinforcement Learning framework with a multi-component reward function (precision, detail, relevance) to train Reasoning LLMs to generate factual, detailed, and helpful long-form responses.
Core Problem
Current Reasoning LLMs (R-LLMs) like DeepSeek-R1 significantly hallucinate more than non-reasoning models on long-form factuality tasks, and existing offline RL methods for factuality lead to reward hacking (e.g., extremely short responses).
Why it matters:
  • R-LLMs are increasingly entrusted with high-stakes tasks where factuality is critical, yet they currently hallucinate 10-13 percentage points more than base models
  • Existing automatic factuality evaluations (e.g., FActScore) are too slow for online RL loops and lack reliable recall metrics, causing models to optimize for precision by generating mostly empty answers
  • Offline RL (DPO) on factuality data degrades general response quality and helpfulness
Concrete Example: When optimizing solely for factual precision, a model might answer 'Who is Leon Wildes?' with a generic, safe response about immigration law that is factually true but irrelevant to the specific person, effectively hacking the metric.
Key Novelty
Online RL for Factual Reasoning (SFT + GRPO)
  • Introduces a composite reward function balancing three competing objectives: factual precision (correctness), detail level (quantity of facts), and answer relevance (helpfulness)
  • Implements a high-speed, parallelized version of the VeriScore evaluation metric to enable real-time reward calculation during online RL training loops
  • Applies Group Relative Policy Optimization (GRPO) to long-form factuality, shifting from offline preference optimization (DPO) to on-policy learning
Evaluation Highlights
  • Achieves 68.1% average factual precision across six benchmarks, a +23.1 point improvement over the Llama-3.1-8B-Instruct base model
  • Increases response detail level by 23% (more factual claims generated) compared to the base model, avoiding the brevity penalty common in prior work
  • Maintains >50% win rate against the base model on helpfulness evaluations, unlike offline DPO baselines which degraded to ~37% win rate
Breakthrough Assessment
8/10
Significant because it successfully applies online RL (GRPO)—usually reserved for math/code—to open-ended factuality by solving the reward latency and hacking problems. Demonstrates strong empirical gains over both base models and offline RL.
×