Toward Honest Language Models for Deductive Reasoning

📝 Paper Summary

Deductive Reasoning Model Alignment / Honesty Reinforcement Learning (RL)

Anchor stabilizes reinforcement learning for deductive reasoning by injecting ground-truth trajectories into rollout groups, enabling models to learn to honestly abstain on unanswerable queries without training collapse.

Core Problem

LLMs fail to reason honestly, often fabricating answers when premises are insufficient. Standard RL methods like GRPO collapse when all rollouts in a group fail (zero reward), which is common in hard reasoning tasks.

Why it matters:

Reliable deployment requires models to recognize knowledge boundaries and abstain ('I don't know') rather than hallucinate, especially in safety-critical applications
Existing benchmarks focus on factual uncertainty (recall) rather than deductive answerability (reasoning from premises), leaving this specific type of honesty underexplored
Standard GRPO suffers from vanishing gradients and reinforces overconfidence when the model cannot find *any* correct reasoning path during exploration

Concrete Example: In a linear algebra word problem, if a necessary equation is removed (making the system unsolvable), a standard model will still attempt to calculate a price (e.g., '17 dollars') using irrelevant numbers, rather than correctly outputting 'Unknown'.

Key Novelty

Anchor (Augmented with Necessary Correct and HOnest Reasoning)

Modifies the Group Relative Policy Optimization (GRPO) process by deterministically injecting the ground-truth reasoning path into the group of generated rollouts
Ensures that every training batch contains at least one positive signal (the anchor), preventing the relative advantage of incorrect rollouts from collapsing to zero
Mathematically unifies Supervised Fine-Tuning (SFT) and RL: the anchor acts as a supervised signal while the remaining rollouts provide variance reduction and exploration via RL

Evaluation Highlights

Performance on answerable queries drops to nearly zero for Qwen-2.5-3B-Instruct once reasoning depth exceeds 6 steps (based on RQ1 motivation analysis)
Qwen-3-0.6B performs below random guessing (0.5 accuracy) on binary answerability tasks due to formatting errors
Anchor is proposed to stabilize learning where standard GRPO and SFT fail, though specific improvement numbers are not contained in the provided text snippet

Breakthrough Assessment

7/10

Addresses a critical failure mode in reasoning RL (training collapse on hard negatives) with a theoretically grounded, simple fix. The formal definition of deductive honesty via graph answerability is also valuable.

⚙️ Technical Details

Problem Definition

Setting: Deductive reasoning on Directed Acyclic Hypergraphs (DAH) T=(V, E)

Inputs: A query node q and a set of premises (roots R) presented as natural language

Outputs: A derived conclusion if f(T,q)=1, or 'Unknown' if no valid derivation path exists

Pipeline Flow

Input Processing (Prompt with Premises)
Reasoning Generation (Chain of Thought)
Conclusion/Abstention (Final Answer)

System Modules

Language Model

Generate multi-step reasoning path and final answer

Model or implementation: Qwen variants (e.g., Qwen-3-1.7B, Qwen-2.5-3B-Instruct)

Modeling

Base Model: Qwen-2.5-3B-Instruct, Qwen-3-0.6B, Qwen-3-1.7B

Training Method: Anchor (Reinforcement Learning)

Objective Functions:

Purpose: Stabilize GRPO by ensuring a positive signal exists.

Formally: Injects ground-truth trajectory into the rollout group, effectively adding an SFT-like term to the GRPO gradient update.

Training Data:

GraphLA: Linear algebra problems generated from equation systems, converted to natural language (prices of items).
GraphLI: Logical inference problems generated from implication rules (Modus Ponens, etc.), converted to natural language (events).
Dataset split: Balanced answerable (valid path exists) and unanswerable (path broken by perturbation) instances.

Key Hyperparameters:

reasoning_depth_k: Variable (tested up to >8)
cut_depth_d: Variable
num_variables_V: Variable (tested up to >7)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: Anchor injects a ground-truth path to prevent zero-advantage collapse when all model rollouts are wrong
vs. SFT: Anchor retains the exploration and negative-feedback benefits of RL while using the ground truth to guide the policy
vs. PSFT (Proximal Supervised Fine-Tuning) [not cited in paper]: Anchor integrates the supervised signal directly into the RL rollout group rather than as a separate constraint

Limitations

Performance degrades sharply as reasoning depth increases (k > 6)
Models struggle to generalize honesty; often fail to abstain on unanswerable queries
Qwen-3-0.6B struggles with output formatting, performing below random baselines

Reproducibility

Datasets (GraphLA, GraphLI) are introduced in this paper. Code availability is not explicitly provided in the text. Evaluation relies on synthetic generation of graphs which is described in detail.

📊 Experiments & Results

Evaluation Setup

Multi-step deductive reasoning tasks where the model must answer correctly or abstain ('Unknown').

Benchmarks:

GraphLA (Linear Algebra Reasoning (Natural Language)) [New]
GraphLI (Logical Inference (Natural Language)) [New]

Metrics:

Accuracy (Answerable)
Accuracy (Unanswerable/Abstention)
Binary Classification Accuracy (Answerable vs. Unanswerable)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Accuracy of untrained Qwen models on GraphLA/GraphLI as a function of reasoning depth (k) and graph size (|V|).

Binary classification accuracy (Answerable vs. Unanswerable) across varying depths.

Main Takeaways

Current LLMs (Qwen series) lack honest reasoning capabilities: performance on answerable questions drops to near zero as reasoning depth increases beyond 6 steps.
Models fail to reliably identify unanswerable queries: they either guess randomly or attempt to solve unsolvable problems, showing a lack of awareness of knowledge boundaries.
Standard GRPO fails on this task because when a problem is hard, all rollouts are incorrect, leading to zero relative advantage and training collapse.
Anchor is proposed to fix this by guaranteeing a positive learning signal, though specific performance numbers for the Anchor method are not included in the provided text snippet.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient methods)
Deductive Reasoning (Propositional Logic, Linear Algebra)
Graph Theory (DAGs, Hypergraphs)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input against their group mean, removing the need for a critic model

SFT: Supervised Fine-Tuning—training a model to mimic expert demonstrations

Honest Deductive Reasoning: The ability of a model to produce a conclusion only when it is logically entailed by premises, and to abstain (output 'Unknown') otherwise

DAH: Directed Acyclic Hypergraph—a graph structure where edges connect a set of premise nodes to a conclusion node, used here to model reasoning chains

Rollout: A single complete sequence generated by the model during the exploration phase of Reinforcement Learning

Reasoning Depth (k): The number of steps (edges) in the derivation path from premises to the query conclusion

Cut Depth (d): In unanswerable instances, the distance from the query node where the reasoning chain is broken (edge removed)