Offline Reinforcement Learning for LLM Multi-Step Reasoning

📝 Paper Summary

Offline Reinforcement Learning LLM Reasoning Value-Guided Search

OREO improves LLM reasoning by jointly training a policy and a value function to satisfy the soft Bellman equation, enabling fine-grained credit assignment from sparse rewards without paired preference data.

Core Problem

Current offline alignment methods like DPO require paired preference data (scarce in reasoning) and struggle with credit assignment because they treat entire trajectories uniformly, while rejection sampling ignores valuable failure data.

Why it matters:

Training LLMs with online RL (like PPO) is prohibitively expensive for most users due to data generation costs
Reasoning tasks typically provide only sparse terminal rewards (correct/incorrect), making it difficult to identify which specific step caused an error
Discarding incorrect trajectories (as in Rejection Sampling) wastes significant information about failure modes that could improve robustness

Concrete Example: In a math problem, a model might make a small calculation error in step 2 but fail at the final step 10. DPO penalizes the entire sequence equally against a correct one. OREO's value function identifies the drop in expected reward specifically at step 2, providing precise feedback.

Key Novelty

Offline REasoning Optimization (OREO)

Adapts Path Consistency Learning to LLMs by minimizing the difference between the current value and the target value (reward + next state value) at every token step
Explicitly learns a value function alongside the policy from offline data, allowing the model to estimate expected future rewards for intermediate steps even when only terminal rewards are known
Uses the learned value function 'for free' at test time to guide beam search or select the best-of-K actions, filtering out incorrect reasoning paths early

Architecture

Comparison of implicit (policy-based) vs explicit (OREO) value functions on math problems.

Evaluation Highlights

Outperforms Rejection Sampling by +10.4% success rate on ALFWorld (Unseen split) using MiniCPM-2B, demonstrating superior generalization
Achieves 52.5% accuracy on MATH with Qwen2.5-Math-1.5B, surpassing DPO (49.2%) and Rejection Sampling (50.3%)
Test-time beam search guided by the learned value function improves MATH accuracy by 17.9% relative to greedy decoding

Breakthrough Assessment

8/10

Offers a principled theoretical correction to DPO's limitations in reasoning. The method consistently beats strong baselines (Rejection Sampling, DPO) and provides a practically useful value model for inference scaling.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) where states are token sequences, actions are generated tokens, and rewards are sparse (non-zero only at terminal step T)

Inputs: Prompt/Question x

Outputs: Reasoning chain and final answer y

Pipeline Flow

Input Prompt
Policy Model (generates next token/step)
Value Model (evaluates current state)
Inference Search (optional: Beam Search / Best-of-K)

System Modules

Policy Model

Generate reasoning tokens autoregressively

Model or implementation: Qwen2.5-Math-1.5B / DeepSeekMath-7B-Instruct

Value Network

Estimate the expected return (correctness probability) of the current partial generation

Model or implementation: Same architecture as Policy (often initialized from SFT checkpoint with scalar head)

Search Strategy

Select optimal trajectory using Value Network scores

Model or implementation: Step-level Beam Search (Math) or Best-of-K (Agents)

Novel Architectural Elements

Explicit Value Network parameterization trained jointly with the policy on offline data (unlike DPO/KTO which use implicit values)
Integration of Value Network into step-level beam search for test-time scaling

Modeling

Base Model: Qwen2.5-Math-1.5B, DeepSeekMath-7B-Instruct, MiniCPM-2B

Training Method: OREO (Offline REasoning Optimization)

Objective Functions:

Purpose: Ensure the learned value matches the target value defined by rewards and policy entropy.

Formally: MSE loss on the soft Bellman error: L_V(φ) = MSE(V_φ(s_t) - (R_t + β * log_sum_exp terms))
Purpose: Update policy to be consistent with the value estimates and reference policy.

Formally: L_π(θ) derived from minimizing soft Bellman inconsistency with stop-gradients on target terms
Purpose: Stabilize training.

Formally: KL regularization term L_reg = KL(π_θ || π_ref)

Adaptation: Full fine-tuning

Key Hyperparameters:

beta: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Experiments run on 1.5B, 2B, and 7B parameter models

Comparison to Prior Work

vs. DPO: OREO uses unpaired data and enables step-level credit assignment via an explicit value function, whereas DPO requires pairs and assigns credit to the whole sequence
vs. Rejection Sampling: OREO learns from both success and failure trajectories (via value updates), whereas Rejection Sampling discards failures
vs. PCL (Path Consistency Learning) [not cited in paper]: OREO applies similar soft consistency principles but specifically adapts them for large language model reasoning and token-level optimization

Limitations

Experiments limited to relatively small models (up to 7B parameters)
Computationally more expensive than DPO due to training a separate value network
Requires ground truth reasoning chains or environments to verify correctness (cannot learn from purely unlabeled data without a verifier/environment)

Reproducibility

Code: https://github.com/jwhj/OREO

Code is publicly available at https://github.com/jwhj/OREO. The paper uses standard datasets (GSM8K, MATH, ALFWorld). Hyperparameters like beta and learning rate are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Offline training on datasets with ground truth solutions (Math) or expert trajectories (Agents), followed by evaluation on test sets.

Benchmarks:

GSM8K (Grade school math reasoning)
MATH (Competition-level math reasoning)
ALFWorld (Embodied agent control (TextWorld))

Metrics:

Accuracy (Math)
Success Rate (ALFWorld)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math reasoning results showing OREO outperforms baselines across varying model sizes.
MATH	Accuracy	50.3	52.5	+2.2
GSM8K	Accuracy	74.9	77.3	+2.4
MATH	Accuracy	47.2	49.2	+2.0
Embodied agent results demonstrating strong generalization to unseen environments.
ALFWorld (Unseen)	Success Rate	68.7	79.1	+10.4
ALFWorld (Seen)	Success Rate	79.3	80.7	+1.4

Experiment Figures

Accuracy on GSM8K and MATH over 3 training iterations for OREO vs Rejection Sampling.

Accuracy on GSM8K and MATH (subset) using Value-Guided Beam Search with varying compute budgets.

Main Takeaways

OREO consistently outperforms DPO and Rejection Sampling across all benchmarks, especially in harder tasks (MATH, ALFWorld Unseen).
Iterative training with OREO leads to continuous improvement without the saturation observed in Rejection Sampling.
The explicit value function learned by OREO provides better signal for correctness than the implicit policy prob (DPO), enabling effective test-time search.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Value Functions)
Maximum Entropy RL / Soft Q-Learning
Language Model Alignment (DPO, RLHF)

Key Terms

Soft Bellman Equation: A consistency condition in maximum entropy RL relating the optimal value function to the immediate reward and the entropy-regularized value of the next state

DPO: Direct Preference Optimization—an offline method aligning models to preferences by optimizing the policy directly without a separate reward model, typically requiring paired data

Rejection Sampling: A simple baseline where the model generates multiple samples, filters for correct ones, and fine-tunes on those correct trajectories (also known as STaR)

Sparse Reward: A setting where feedback (reward) is only received at the end of a task (e.g., correct answer), not at every intermediate step

Credit Assignment: The problem of determining which past actions contributed to a final outcome; difficult in reasoning when a long chain yields a single final score

PCL: Path Consistency Learning—an algorithm that unifies value and policy learning by enforcing consistency between values along a trajectory

Value Function: A model that predicts the expected future cumulative reward from a given state (partial reasoning chain)

Beam Search: A search algorithm that explores a graph by expanding the most promising nodes in a limited set

SFT: Supervised Fine-Tuning—training on labeled target outputs