Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

📝 Paper Summary

Process Supervision Reasoning-as-Planning Reinforcement Learning from Feedback

The framework synthesizes process rewards by running offline simulations from intermediate reasoning states to estimate their success rates, then uses these rewards to train a policy via Direct Preference Optimization.

Core Problem

Large Language Models often hallucinate during complex reasoning, and existing solutions like online planning (MCTS) are too slow, while human process supervision is too expensive.

Why it matters:

Online planning (Reasoning-as-Planning) introduces high latency due to frequent state assessments and large search spaces during inference
Process supervision (step-by-step feedback) is effective but relies on costly human annotation, making it difficult to scale
Outcome-only supervision fails to correct flawed reasoning traces that luckily arrive at the correct answer (false positives)

Concrete Example: In logical reasoning, an LLM might reach a correct conclusion using invalid logic (hallucination). Outcome supervision would reward this, reinforcing the bad logic. Search-based methods like MCTS could catch this but require hundreds of rollouts at inference time, making the system impractically slow.

Key Novelty

Offline Simulation for Process Reward Synthesis + DPO

Instead of expensive human labels, the system estimates the 'value' of an intermediate reasoning step by simulating multiple completions (Monte Carlo rollouts) and checking how many lead to the correct answer
These estimated values train a Process Reward Model (PRM), which scores full trajectories
The policy model is then optimized using Direct Preference Optimization (DPO) on pairs of trajectories ranked by these synthesized process rewards, avoiding unstable PPO training

Architecture

Illustration of the reasoning process as a Markov Decision Process (MDP) using the ReAct format

Evaluation Highlights

Surpasses strong counterparts like GPT-3.5-Turbo on challenging logical reasoning benchmarks using a 7B parameter model
Demonstrates significant improvements over robust baseline models on logical and mathematical reasoning tasks
Reduces reliance on human annotations by synthesizing process rewards automatically via outcome-guided simulation

Breakthrough Assessment

7/10

Clever combination of offline simulation (to replace human process supervision) and DPO. Effectively moves the compute cost of 'planning' from inference time to training time.

⚙️ Technical Details

Problem Definition

Setting: Natural Language Reasoning as a Markov Decision Process (MDP)

Inputs: Context/Question x

Outputs: Reasoning trajectory τ consisting of state-action pairs <s, a> leading to answer y

Pipeline Flow

Seed Trajectory Collection (LLM generates initial solutions)
Offline Simulation (Sample intermediate states, rollout K completions)
Reward Estimation (Calculate success rate of rollouts)
PRM Training (Train classifier to predict success rate)
DPO Training (Optimize Policy LLM using PRM-scored trajectories)

System Modules

Policy Model (LLM)

Generates the reasoning steps (actions) and states; initialized as the base LLM

Model or implementation: 7B model (likely LLaMA-2 based on context)

Process Reward Model (PRM)

Predicts the expected correctness/reward of an intermediate reasoning step

Model or implementation: Classifier trained on simulation data

Novel Architectural Elements

Integration of offline Monte Carlo simulation specifically to synthesize labels for a Process Reward Model
Utilization of trajectory-level rewards (accumulated from PRM) to construct preference pairs for DPO training

Modeling

Base Model: 7B model (implied LLaMA-2-7B-Chat or similar from context)

Training Method: Direct Preference Optimization (DPO) with synthesized process rewards

Objective Functions:

Purpose: Train PRM to predict success probability of intermediate steps.

Formally: Cross-Entropy loss minimizing difference between predicted reward and empirical success rate r_j.
Purpose: Optimize policy to prefer high-reward trajectories.

Formally: DPO loss L_DPO(pi_theta; pi_ref) + Margin-based pairwise loss L_pair for correct trajectories.

Training Data:

Seed trajectories collected from LLM
K completions sampled for intermediate states via offline simulation
Outcomes verified against ground truth labels (y)

Key Hyperparameters:

K: Number of rollout trajectories for simulation (value not reported in snippet)
C: Minimum successful simulations threshold (Eq. 7)
beta: DPO KL divergence penalty parameter (Eq. 9)
+ 1 more
sigma: Confidence margin for pairwise loss (Eq. 10)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAP: RAP incurs high inference latency due to online search; this method compiles the search benefit into the policy via offline simulation and training
vs. Process Supervision: Eliminates the need for expensive human annotation by synthesizing rewards from outcome verification
vs. MATH-Shepherd: This work focuses on DPO for optimization and includes logical reasoning tasks, whereas MATH-Shepherd focuses on PPO/verification [not cited in paper detail]

Limitations

Relies on the availability of ground truth final answers to verify simulations (cannot be applied to open-ended creative generation)
Simulation quality depends on the base capability of the LLM; if the model cannot reach the answer, rewards cannot be estimated
Computational cost is shifted to the training phase (offline simulation) which can still be significant

Reproducibility

Code: https://github.com/SparkJiao/dpo-trajectory-reasoning

Code and trajectory data are released at SparkJiao/dpo-trajectory-reasoning. The paper snippet provided does not contain specific hyperparameter values (learning rates, batch sizes), though they may be in the full appendix.

📊 Experiments & Results

Evaluation Setup

Offline simulation for reward synthesis followed by DPO training, evaluated on reasoning tasks

Benchmarks:

Logical Reasoning Benchmarks (Logical deduction/Reasoning)
Mathematical Reasoning Benchmarks (Math word problems (likely GSM8K based on context))

Metrics:

Accuracy (Outcome Correctness)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison between Search-based Inference (RAP) and the proposed Offline Simulation Training framework

Main Takeaways

The 7B parameter model trained with this framework outperforms GPT-3.5-Turbo on logical reasoning benchmarks, validating the efficacy of synthesized process rewards
Offline simulation effectively identifies reliable reasoning paths without requiring human process labels
Using PRM-scored trajectories with DPO provides a stable training signal compared to traditional RL methods like PPO

📚 Prerequisite Knowledge

Prerequisites

Markov Decision Processes (MDP)
Reinforcement Learning (RL) concepts: Rewards, Policy, Value estimation
Large Language Models (LLMs) and Chain-of-Thought (CoT)

Key Terms

DPO: Direct Preference Optimization—an algorithm that optimizes a policy directly from preference pairs without explicitly training a reward model in the loop (unlike PPO)

PRM: Process Reward Model—a model trained to assign scores to intermediate steps of reasoning, rather than just the final outcome

MCTS: Monte Carlo Tree Search—a heuristic search algorithm that expands the most promising moves by simulating outcomes

Reasoning-as-Planning (RAP): Modeling the generation of reasoning steps as a planning problem (like chess) where future states are explored before committing

Offline Simulation: Running rollouts (simulations) during the training/data collection phase to estimate values, rather than during live inference

ReAct: Reasoning and Acting—a paradigm where LLMs generate reasoning traces and task-specific actions in an interleaved manner

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm often used for RLHF, known to be sometimes unstable or resource-intensive