← Back to Paper List

Toward Honest Language Models for Deductive Reasoning

Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
Carnegie Mellon University, Amazon, Emory University, Cornell University
arXiv (2025)
Reasoning RL Factuality Benchmark

📝 Paper Summary

Deductive Reasoning Model Alignment / Honesty Reinforcement Learning (RL)
Anchor stabilizes reinforcement learning for deductive reasoning by injecting ground-truth trajectories into rollout groups, enabling models to learn to honestly abstain on unanswerable queries without training collapse.
Core Problem
LLMs fail to reason honestly, often fabricating answers when premises are insufficient. Standard RL methods like GRPO collapse when all rollouts in a group fail (zero reward), which is common in hard reasoning tasks.
Why it matters:
  • Reliable deployment requires models to recognize knowledge boundaries and abstain ('I don't know') rather than hallucinate, especially in safety-critical applications
  • Existing benchmarks focus on factual uncertainty (recall) rather than deductive answerability (reasoning from premises), leaving this specific type of honesty underexplored
  • Standard GRPO suffers from vanishing gradients and reinforces overconfidence when the model cannot find *any* correct reasoning path during exploration
Concrete Example: In a linear algebra word problem, if a necessary equation is removed (making the system unsolvable), a standard model will still attempt to calculate a price (e.g., '17 dollars') using irrelevant numbers, rather than correctly outputting 'Unknown'.
Key Novelty
Anchor (Augmented with Necessary Correct and HOnest Reasoning)
  • Modifies the Group Relative Policy Optimization (GRPO) process by deterministically injecting the ground-truth reasoning path into the group of generated rollouts
  • Ensures that every training batch contains at least one positive signal (the anchor), preventing the relative advantage of incorrect rollouts from collapsing to zero
  • Mathematically unifies Supervised Fine-Tuning (SFT) and RL: the anchor acts as a supervised signal while the remaining rollouts provide variance reduction and exploration via RL
Evaluation Highlights
  • Performance on answerable queries drops to nearly zero for Qwen-2.5-3B-Instruct once reasoning depth exceeds 6 steps (based on RQ1 motivation analysis)
  • Qwen-3-0.6B performs below random guessing (0.5 accuracy) on binary answerability tasks due to formatting errors
  • Anchor is proposed to stabilize learning where standard GRPO and SFT fail, though specific improvement numbers are not contained in the provided text snippet
Breakthrough Assessment
7/10
Addresses a critical failure mode in reasoning RL (training collapse on hard negatives) with a theoretically grounded, simple fix. The formal definition of deductive honesty via graph answerability is also valuable.
×