← Back to Paper List

Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Tianle Cai, Wenhao Huang
ByteDance Seed, Carnegie Mellon University, Fudan University
arXiv (2025)
Factuality RL Reasoning QA

📝 Paper Summary

Hallucination suppression Uncertainty quantification
Trains LLMs to be honest communicators by replacing binary correctness rewards with strictly proper scoring rules, incentivizing the model to abstain or flag uncertainty when confidence is low.
Core Problem
Standard RL training with binary rewards incentivizes models to guess whenever the probability of being right is non-zero, creating 'good test-takers' rather than honest agents.
Why it matters:
  • Safety requires models to know when they are wrong, not just answer correctly
  • Current reward systems actively penalize abstention, forcing models to masquerade guesses as facts
  • Hallucinations persist even in large reasoning models because evaluation metrics fail to penalize confident errors
Concrete Example: In a math problem, if a model is 51% confident in a guess, standard RL pushes it to state the answer as fact to get a +1 reward. A calibrated model would instead say 'I don't know' or flag the step if the user's risk tolerance is low.
Key Novelty
Behavioral Calibration via Proper Scoring Rules
  • Optimizes a single policy to handle any user-specified risk tolerance by integrating rewards over a distribution of risk thresholds
  • Replaces binary correctness rewards with a 'proper scoring rule' (like Brier score) that mathematically maximizes reward only when the model's stated confidence matches its true accuracy
  • Extends calibration to individual claims within a response, allowing the model to highlight specific uncertain steps while maintaining the overall answer structure
Evaluation Highlights
  • Achieves 0.806 log-scale gain in Signal-to-Noise Ratio (SNR) on BeyondAIME, significantly outperforming GPT-5's gain of 0.207
  • Matches calibration error of frontier models (Grok-4, Gemini-2.5-Pro) on SimpleQA zero-shot, despite using a much smaller 4B parameter model
  • Claim-level uncertainty highlighting achieves 0.183 log-scale SNR gain, surpassing Gemini-2.5-Pro (0.019)
Breakthrough Assessment
8/10
Strong theoretical grounding in proper scoring rules applied to RL, with impressive empirical results showing small models beating frontier models in calibration tasks.
×