Boosting Deductive Reasoning with Step Signals In RLHF

📝 Paper Summary

Deductive Reasoning RLHF Data Synthesis

MuseD is a scalable data synthesis method that generates formal logic problems with verifiable reasoning steps, enabling dense reward signals for RLHF that significantly improve model reasoning capabilities.

Core Problem

Training LLMs for multi-step deductive reasoning is difficult because existing data lacks verifiable step-by-step supervision, and generating contradiction-free formal logic prompts with scalable complexity is challenging.

Why it matters:

Current reasoning datasets often rely on outcome supervision, which fails to correct flawed reasoning processes (hallucinations or logical leaps) leading to correct answers by chance
Manual creation of rigorous multi-step logic problems is expensive and unscalable, limiting the amount of high-quality training data available for alignment

Concrete Example: Given premises 'Cats are mammals' and 'Mammals are animals', a model might conclude 'Cats are animals' using common sense shortcuts rather than logic. MuseD uses virtual entities (e.g., 'Alpha is Beta') to force actual deductive reasoning and verifies the specific elimination of middle terms.

Key Novelty

Multi-step Deduction (MuseD) Data Synthesis & Step-Level Scoring

Backward generation of logic trees: Starts from a conclusion and recursively adds premises using valid syllogisms to guarantee contradiction-free, solvable prompts with controllable complexity
Step-level verification: Scores reasoning chains by tracking the elimination of 'middle terms' (logical connectors), ensuring the model actually performed the deduction steps rather than guessing

Architecture

The MuseD data generation pipeline: Backward generation of the logic tree followed by forward entity filling.

Evaluation Highlights

RLHF with MuseD data improves Llama-3-8B-Instruct performance by +15.5% on the out-of-domain FOLIO benchmark compared to the base model
Achieves +30.5% improvement on the in-domain MuseD test set compared to the base model
Step-level rewards (Process + Outcome) outperform Outcome-only rewards by ~10% on difficult reasoning tasks (10-step depth)

Breakthrough Assessment

7/10

Strong methodology for synthetic data generation in formal logic. The step-level scoring mechanism is verifiably correct by design, addressing a major bottleneck in process supervision for reasoning.

⚙️ Technical Details

Problem Definition

Setting: Multi-step deductive reasoning based on categorical syllogisms (Aristotelian logic forms A, E, I, O)

Inputs: A set of premises (e.g., 'All A are B', 'No B are C') and a hypothesis to prove or judge

Outputs: A reasoning chain deriving the conclusion and a final verdict (True/False/Unknown)

Pipeline Flow

Prompt Generation (Backward Logic Tree Construction)
Response Generation (LLM Sampling)
Response Scoring (Step & Outcome Verification)
Preference Pair Construction
RLHF Training (Reward Modeling + PPO)

System Modules

Prompt Generator

Construct logical premise sets from a target conclusion

Model or implementation: Rule-based Syllogism Engine

Response Scorer

Evaluate the correctness of the reasoning chain

Model or implementation: Algorithm (Rule-based)

Reward Model (Training)

Predict scalar rewards for PPO training

Model or implementation: Llama-3-8B-Instruct (initialized)

Policy Model (Training)

Generate reasoning steps

Model or implementation: Llama-3-8B-Instruct

Modeling

Base Model: Llama-3-8B-Instruct

Training Method: RLHF (Reward Modeling + PPO)

Objective Functions:

Purpose: Train reward model to distinguish better reasoning chains.

Formally: Binary cross-entropy loss on preference pairs where 'chosen' has higher step/result score than 'rejected'
Purpose: Optimize policy to maximize expected reward.

Formally: PPO clipping objective

Training Data:

Synthesized 100k prompts using MuseD method
Sampled 5 responses per prompt using Llama-3-8B-Instruct
Constructed preference pairs based on Step Score and Result Score

Key Hyperparameters:

learning_rate: 1e-6 (RM), 5e-7 (PPO)
batch_size: 16 (RM), 64 (PPO)
epochs: 1
+ 2 more
max_length: 2048
kl_coefficient: 0.01

Compute: 8 * NVIDIA A800 80G GPUs

Comparison to Prior Work

vs. Outcome Supervision (ORM): MuseD uses dense 'Step Scores' derived from the logic tree structure, providing supervision on the *process* of elimination, not just the final answer
vs. Standard Synthetic Data (e.g. standard syllogisms): MuseD explicitly models the 'backward' generation to control complexity (tree depth) and uses virtual entities to prevent common-sense shortcuts
vs. Process Reward Models (PRM): Typical PRMs require human annotation or expensive model-based labelling; MuseD automates process labelling via the underlying logic tree structure

Limitations

Evaluation is primarily focused on formal deductive reasoning; applicability to inductive or abductive reasoning is not explored
The synthetic nature of the data (virtual entities like 'Alpha', 'Beta') creates a distribution shift from natural language reasoning tasks
Performance gains on out-of-domain natural language datasets (like FOLIO) are significant but smaller than on in-domain synthetic data

Reproducibility

Code: https://github.com/zhangyipin/mused/tree/main

Code for data generation and evaluation is available at https://github.com/zhangyipin/mused/tree/main. The paper specifies the base model (Llama-3-8B-Instruct) and hyperparameters for training. The exact training dataset is synthetic and can be reproduced using the provided method.

📊 Experiments & Results

Evaluation Setup

Evaluate on both in-domain synthetic data (MuseD-Test) and out-of-domain standard reasoning benchmarks.

Benchmarks:

MuseD-Test (Multi-step deductive reasoning (synthetic)) [New]
FOLIO (First-order logic reasoning (natural language))
ProofWriter (Rule-based reasoning (natural language))
ProntoQA (Synthetic deductive reasoning)
LogicalDeduction (Logical reasoning (BBH subset))

Metrics:

Accuracy (Result Score)
Step Score (Process correctness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing RLHF with MuseD data significantly boosts performance over the base model on both in-domain and out-of-domain tasks.
MuseD-Test	Accuracy	59.3	89.8	+30.5
FOLIO	Accuracy	48.0	63.5	+15.5
ProofWriter	Accuracy	58.4	67.0	+8.6
ProntoQA	Accuracy	46.0	62.4	+16.4
LogicalDeduction	Accuracy	47.2	61.6	+14.4
Ablation study on reward composition (Method PN vs R) showing the importance of step-level signals.
MuseD-Test (Average)	Accuracy	86.1	89.8	+3.7
MuseD-Test (10-step)	Accuracy	69.1	79.0	+9.9

Experiment Figures

Comparison of RLHF strategies (Method PN vs Method R) across different reasoning depths (number of steps).

Main Takeaways

Step-level supervision is critical for complex reasoning: While outcome-only rewards work for simple tasks, dense step rewards (MuseD) provide a ~10% accuracy gain on deep reasoning chains (10 steps).
Synthetic logic training generalizes: Training on abstract, virtual-entity logic problems (Alpha, Beta) transfers effectively to natural language reasoning tasks (FOLIO, ProofWriter), suggesting the model learns the underlying deductive mechanism.
Positive rewards drive learning: Experiments comparing Positive-Only (P) vs Positive-Negative (PN) pairs show that identifying correct steps is more important than penalizing wrong ones, though using both (PN) yields the best results.

📚 Prerequisite Knowledge

Prerequisites

Formal logic (Categorical propositions and syllogisms)
Reinforcement Learning from Human Feedback (RLHF)
PPO (Proximal Policy Optimization)

Key Terms

Syllogism: A logical argument applying deductive reasoning to arrive at a conclusion based on two premises (Major and Minor)

Middle Term: The term that appears in both premises but not in the conclusion, which must be 'eliminated' during the reasoning process to link the subject and predicate

RLHF: Reinforcement Learning from Human Feedback—a method to align language models using reward models trained on preference data

MuseD: Multi-step Deduction—the authors' proposed method for synthesizing logical data and scoring reasoning steps

Categorical Proposition: A proposition that asserts or denies that all or some of the members of one category (the subject term) are included in another (the predicate term)

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used here to fine-tune the model against the reward model

Step Score: A metric defined in this paper that calculates the ratio of correctly eliminated middle terms in the generated reasoning chain

FOLIO: First-Order Logic Interpolation and Optimization—a benchmark dataset for first-order logic reasoning

ProofWriter: A synthetic dataset for logical reasoning over natural language rules