What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Chain of Thought (CoT) Reasoning Large Language Model (LLM) Post-training

VC-PPO stabilizes Proximal Policy Optimization (PPO) for long chain-of-thought tasks by pre-training the value model and decoupling advantage estimation parameters to fix reward signal decay.

Core Problem

Standard PPO fails in long Chain-of-thought (CoT) tasks because the value model is biased, leading to inaccurate advantage estimation and a collapse in output length.

Why it matters:

Current RLHF methods like PPO struggle with the long reasoning chains required for complex math problems, often degrading into short, incorrect answers
Alternative methods like GRPO lack PPO's fine-grained token-level feedback, potentially limiting exploration efficiency in complex tasks
The standard practice of initializing the value model from the reward model creates immediate bias, while standard GAE parameters cause reward signals to vanish over long sequences

Concrete Example: When training on math problems, standard PPO rapidly shortens the model's response length (e.g., from thousands of tokens to very few) early in training. This happens because the value model underestimates the value of early tokens in long sequences due to discount factor decay, causing the policy to view early reasoning steps as having low advantage.

Key Novelty

Value-Calibrated PPO (VC-PPO)

Pre-trains the value model on SFT data to eliminate the 'cold-start' bias caused by initializing it from a reward model that only scores end-of-sentence tokens
Decouples Generalized Advantage Estimation (GAE) parameters: uses a discount factor of 1.0 for the value target to prevent signal decay over long chains, while keeping a lower factor for the policy to maintain stability

Architecture

Comparison of PPO vs VC-PPO performance and length during training

Evaluation Highlights

Achieves 49.0% accuracy on the AIME benchmark, significantly outperforming standard PPO which collapses to 5.6%
Surpasses the DeepSeek-R1-Zero reproduction result of 39.0% on AIME reported in previous literature
Maintains stable output lengths (reasoning chains) throughout training, avoiding the length collapse observed in baseline PPO

Breakthrough Assessment

8/10

Identifies and fixes a critical, specific failure mode of PPO in the high-impact area of Long-CoT reasoning. The solution is theoretically grounded and yields massive empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Token-level Markov Decision Process (MDP) for language generation

Inputs: A prompt x (e.g., a math problem)

Outputs: A response y consisting of a sequence of tokens, ending with a terminal action

Pipeline Flow

Prompt Input -> Policy Model (Actor) -> Generate Long-CoT Response
Response -> Reward Model (Verifier) -> Scalar Reward at <EOS>
Response -> Value Model (Critic) -> Token-level Value Estimates
Advantage Computation (Decoupled GAE) -> Policy Update (PPO) & Value Update

System Modules

Policy Model (Actor)

Generates the reasoning chain and final answer

Model or implementation: Based on Qwen-2.5-32B-Instruct

Reward Model (Verifier)

Evaluates the correctness of the final answer

Model or implementation: Rule-based verifier

Value Model (Critic)

Estimates expected future rewards for each token to compute advantages

Model or implementation: Initialized from Reward Model, then Pre-trained

Novel Architectural Elements

Decoupled GAE Architecture: The advantage calculation uses different lambda parameters for the Policy update (standard lambda) vs. the Value target calculation (lambda=1.0)

Modeling

Base Model: Qwen-2.5-32B-Instruct

Training Method: Value-Calibrated PPO (VC-PPO)

Objective Functions:

Purpose: Maximize expected reward with KL constraint.

Formally: Standard PPO clipped surrogate objective.
Purpose: Minimize error in value prediction.

Formally: MSE between predicted value and computed return (using lambda=1.0).
Purpose: Calibrate value model before RL.

Formally: Minimize MSE on offline SFT trajectories using Monte-Carlo returns (lambda=1.0).

Key Hyperparameters:

kl_coefficient: 0.02
clip_range: 0.2
policy_gae_lambda: 0.95
+ 5 more
value_gae_lambda: 1.0
learning_rate_actor: 1e-6
learning_rate_critic: 5e-6
per_device_batch_size: 1 (with gradient accumulation)
num_epochs: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO: VC-PPO pre-trains the value model and uses lambda=1.0 for value targets to prevent signal decay in long sequences
vs. GRPO: VC-PPO retains a learned value model to enable token-level feedback, whereas GRPO relies on sparse response-level comparisons
vs. OpenAI o1/DeepSeek-R1 [not cited in paper as baseline, but context]: VC-PPO attempts to stabilize PPO to achieve similar Long-CoT results without discarding the value network

Limitations

Experiments are limited to a single benchmark (AIME) and a single domain (Math)
Computational cost of value pre-training is an additional step compared to standard PPO or GRPO
Does not explicitly compare against GRPO in the main results table, only against standard PPO
No statistical significance tests reported

Reproducibility

The paper does not provide a link to code or data. The method is described mathematically (Algorithms and Equations). The base model (Qwen-2.5-32B) is public, but the specific SFT data and rule-based verifiers are generic descriptions.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning using Chain-of-Thought

Benchmarks:

AIME (High-school competition math problems)

Metrics:

Pass@1 Accuracy
Average Output Length (tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AIME	Pass@1	5.6	49.0	+43.4
AIME	Pass@1	18.1	49.0	+30.9
AIME	Pass@1	42.0	49.0	+7.0
AIME	Pass@1	42.5	49.0	+6.5

Experiment Figures

Correlation between token position and advantage values in standard PPO

The effect of GAE lambda parameter on reward signal propagation

Main Takeaways

Standard PPO suffers from 'value initialization bias' (from initializing with the reward model) and 'reward signal decay' (from GAE lambda < 1.0), causing it to fail on long tasks.
Pre-training the value model on SFT data prevents the initial collapse in output length by providing accurate initial value estimates.
Decoupling GAE (using lambda=1.0 for value targets) allows reward signals to propagate from the end of the chain to the beginning, which is crucial for long CoT sequences.
Both proposed components (Value Pre-training and Decoupled GAE) are necessary for optimal performance; removing either leads to a significant drop in accuracy.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Policy Gradients, Value Functions)
Proximal Policy Optimization (PPO)
Generalized Advantage Estimation (GAE)
Chain of Thought (CoT) prompting

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that updates policies in small, constrained steps to ensure stability

CoT: Chain of Thought—a prompting strategy where models generate intermediate reasoning steps before the final answer

GAE: Generalized Advantage Estimation—a method to estimate the 'advantage' of an action (how much better it is than average) by balancing bias and variance

SFT: Supervised Fine-Tuning—training a model on labeled examples (prompt-response pairs) before applying RL

GRPO: Group Relative Policy Optimization—a PPO variant that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a learned value model

Value Model: A neural network that predicts the expected future reward from a specific state (sequence of tokens)

AIME: American Invitational Mathematics Examination—a challenging math competition benchmark used to evaluate reasoning capabilities

RLHF: Reinforcement Learning from Human Feedback—fine-tuning LLMs using rewards derived from human or rule-based preferences

Olympiad-level math: Extremely difficult mathematics problems requiring complex, multi-step reasoning, typical of competitions like AIME or IMO