RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning

RL-PLUS prevents LLM reasoning collapse by combining internal exploitation with external data using multiple importance sampling and an exploration-weighted advantage function that prioritizes correct but low-probability paths.

Core Problem

Standard RLVR (Reinforcement Learning with Verifiable Reward) improves average performance by refining known paths (inward exploitation) but fails to discover new solutions, causing the model's total potential capability (Pass@k) to shrink compared to the base model.

Why it matters:

Current methods like GRPO improve Pass@1 but degrade Pass@128 (capability boundary collapse), effectively narrowing the model's problem-solving scope
Sparse rewards in long reasoning chains make 'outward exploration' (finding completely new valid paths) extremely difficult for on-policy methods
Reliance on inward exploitation limits the model to merely polishing existing knowledge rather than acquiring new reasoning abilities

Concrete Example: In Figure 1(a), while an RLVR-trained model's Pass@1 exceeds the base model, its Pass@128 is substantially lower, indicating it has lost the breadth of potential solutions the base model originally possessed.

Key Novelty

Hybrid-policy Optimization with Exploration-Based Advantage

Uses Multiple Importance Sampling (MIS) to stabilize learning from external data by treating samples as coming from a mixture of the old policy and external sources, preventing variance explosion
Reshapes the RL reward using an 'Exploration-Based Advantage Function' that amplifies signals for correct answers the model currently assigns low probability to, explicitly incentivizing the learning of 'hard' knowledge

Architecture

Comparison of Pass@k curves between Base Model and RLVR-trained model

Evaluation Highlights

+5.2 average points improvement over SFT+GRPO across six math reasoning benchmarks
Up to 69.2% average relative improvement over GRPO across diverse model families
Analysis of Pass@k curves confirms RL-PLUS maintains high potential (Pass@k) at large k, effectively resolving the capability boundary collapse issue seen in baselines

Breakthrough Assessment

8/10

Identifies and addresses a critical, subtle failure mode of current RLVR (capability collapse) with a theoretically grounded hybrid approach. Strong reported gains over standard GRPO.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for reasoning generation

Inputs: Initial prompt q

Outputs: Reasoning sequence y leading to a verifiable answer

Pipeline Flow

Input Processing: Prompt q
Generation: LLM generates reasoning chain y
Output: Final answer

System Modules

LLM Policy

Generate reasoning steps and final answer

Model or implementation: Evaluated on Qwen2.5-Math-7B, Llama-3.1-8B-Instruct, DeepSeek-Coder-V2-Lite-Instruct

Modeling

Base Model: Qwen2.5-Math-7B, Llama-3.1-8B-Instruct, DeepSeek-Coder-V2-Lite-Instruct

Training Method: RL-PLUS (Hybrid-policy Optimization)

Objective Functions:

Purpose: Optimize policy using both internal and external data.

Formally: J(θ) combines standard policy gradient with a novel term for external data.
Purpose: Stabilize external data learning.

Formally: Multiple Importance Sampling (MIS) ratio r^m_{i,t}(θ) mixes old policy and external policy in the denominator.
Purpose: Incentivize exploration of hard correct paths.

Formally: Exploration-Based Advantage A^c_{i,t} scales reward by C_{i,t} = (1 - P(correct_token))^γ.

Adaptation: Full model update (implied)

Trainable Parameters: Full model

Key Hyperparameters:

gamma: Used in Exploration-Based Advantage (Eq 4)
beta: KL coefficient (usually small or omitted in recent RLVR)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: RL-PLUS adds off-policy external data support via MIS and rewards low-probability correct tokens
vs. ReLIFT/LUFFY: Uses a theoretically grounded Multiple Importance Sampling estimator rather than heuristics or alternating stages
vs. PPO [not cited in paper]: RL-PLUS explicitly removes the clipping mechanism to allow large updates for new knowledge

Limitations

Relies on the availability of verifiable rewards (e.g., math/code), limiting applicability to open-ended tasks
Requires high-quality external data for the off-policy component to be effective
No statistical significance tests reported in the provided summary

Reproducibility

Code: https://github.com/YihongDong/RL-PLUS

Code is publicly available at https://github.com/YihongDong/RL-PLUS. Hyperparameters and exact dataset details are not fully detailed in the provided text snippet but likely in the full paper/appendix.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with binary (correct/incorrect) reward verification

Benchmarks:

GSM8K (Math Reasoning)
MATH (Math Reasoning)
4 other unnamed math benchmarks (Math Reasoning)

Metrics:

Pass@1
Pass@k (k=128)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Aggregate results reported in the Introduction summarizing performance across multiple benchmarks.
Average of 6 math benchmarks	Score	Not reported in the paper	Not reported in the paper	+5.2
Average across model families	Relative Improvement over GRPO	0	69.2	+69.2%

Main Takeaways

RL-PLUS solves the 'Capability Boundary Collapse' problem where standard RLVR improves Pass@1 but degrades Pass@k.
Effective integration of external data (off-policy) with internal exploration (on-policy) is key to surpassing base model limits.
The method generalizes well to Out-of-Distribution (OOD) tasks, suggesting true reasoning improvement rather than just dataset memorization.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient, Importance Sampling)
Large Language Models (Chain-of-Thought reasoning)
Probability concepts (KL divergence, Variance, Bias)

Key Terms

RLVR: Reinforcement Learning with Verifiable Reward—training LLMs using outcomes (like correct math answers) as reward signals

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average, removing the need for a separate value network

Pass@k: A metric measuring the probability that at least one correct solution is generated given k independent attempts

Capability Boundary Collapse: A phenomenon where an RL-tuned model becomes better at frequent queries (high Pass@1) but loses the ability to solve diverse/hard queries (low Pass@k) compared to the base model

SFT: Supervised Fine-Tuning—training a model on labeled examples

MIS: Multiple Importance Sampling—a technique to estimate properties of a target distribution using samples from multiple proposal distributions to reduce variance

OOD: Out-of-Distribution—tasks or data that differ significantly from the training data

On-policy: RL methods that learn only from data generated by the current policy

Off-policy: RL methods that learn from data generated by other policies (e.g., historical data or external demonstrations)