Learning to Reason under Off-Policy Guidance

📝 Paper Summary

Large Reasoning Models (LRMs) Reinforcement Learning with Verifiable Rewards (RLVR)

LUFFY augments on-policy reinforcement learning with off-policy reasoning traces from stronger models using mixed-policy updates and gradient shaping to overcome exploration boundaries.

Core Problem

Standard on-policy RLVR (Reinforcement Learning with Verifiable Rewards) is constrained by the model's initial capabilities; if a model cannot spontaneously generate a correct reasoning chain, it cannot reinforce it, leading to failure in weak models.

Why it matters:

On-policy methods like those used in DeepSeek-R1 primarily amplify existing behaviors rather than teaching genuinely new reasoning skills
Weak foundation models (e.g., Llama-3.1-8B) often hit performance plateaus or fail completely (zero reward) on hard tasks because they lack the 'aha moments' needed to start the RL loop

Concrete Example: When training Llama-3.1-8B on a 'Hard' math subset, standard on-policy RL yields flat zero rewards because the model never generates a correct solution to learn from. In contrast, LUFFY uses off-policy traces to provide initial learning signals, successfully training the model.

Key Novelty

Mixed-Policy GRPO with Policy Shaping

Combines the model's own rollouts (on-policy) with correct reasoning traces from a stronger teacher model (off-policy) in the same group-based advantage computation, allowing the model to imitate when it fails and explore when it succeeds
Introduces 'policy shaping via regularized importance sampling,' which modifies the gradient weights to emphasize low-probability but correct actions from the teacher, preventing the model from lazily memorizing the teacher's style without understanding

Evaluation Highlights

+6.4 point average gain across six math benchmarks (including AIME and MATH-500) using Qwen2.5-Math-7B compared to previous RLVR methods
+6.2 point average improvement on out-of-distribution tasks (ARC-c, GPQA, MMLU-Pro), significantly outperforming the best baseline OpenReasoner-Zero (57.8 vs 51.6)
Successfully trains Llama-3.1-8B on hard tasks where standard On-Policy RL fails completely (0 reward)

Breakthrough Assessment

8/10

Addresses a fundamental limitation of the current RLVR paradigm (on-policy exploration bounds) with a theoretically grounded and empirically effective off-policy integration.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning for mathematical reasoning using verifiable binary rewards (correct/incorrect final answer)

Inputs: Math problem query q

Outputs: Reasoning chain and final answer \boxed{answer}

Pipeline Flow

Prompt Input
Rollout Generation (On-Policy + Off-Policy Retrieval)
Group Advantage Computation
Policy Update (GRPO + Policy Shaping)

System Modules

Hybrid Rollout Generator

Assemble a group of solutions containing both self-generated outputs and retrieved teacher traces

Model or implementation: Qwen2.5-Math-7B (Student) + DeepSeek-R1 (Teacher traces)

Reward Verifier

Check final answers against ground truth

Model or implementation: Rule-based (Regex/Math-Verify)

Mixed-Policy GRPO Updater

Update model weights using mixed-policy objective with shaping

Model or implementation: Pytorch Optimizer

Novel Architectural Elements

Mixed-Policy Advantage Estimation: Calculates advantage by normalizing rewards across a union of on-policy and off-policy samples within the same group
Policy Shaping Function: Applies a non-linear transform f(x) = x/(x+\gamma) to the importance sampling ratio to reweight gradients for off-policy samples

Modeling

Base Model: Qwen2.5-Math-7B

Training Method: Mixed-Policy GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward using both on-policy and off-policy data while keeping updates stable.

Formally: J_Mixed(\theta) = \frac{1}{Z} [ \sum_{off} f(\hat{r}_{j,t}) \hat{A}_j + \sum_{on} CLIP(r_{i,t}, \hat{A}_i, \epsilon) ]
Purpose: Regularize off-policy learning to prevent rigid imitation.

Formally: f(\hat{r}) = \hat{r} / (\hat{r} + \gamma), where \hat{r} is the probability ratio \pi_\theta / \pi_\phi

Adaptation: Full fine-tuning

Training Data:

Subset of OpenR1-Math-220k (45k prompts)
Off-policy traces generated by DeepSeek-R1

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: 64 (update batch size)
rollout_batch_size: 128
+ 4 more
rollouts_per_prompt: 8 (1 off-policy + 7 on-policy)
gamma (shaping): 0.1
entropy_coefficient: 0.01
beta (KL): 0

Compute: 77 hours on 8x GPUs (specific GPU type not reported, likely H800/A100 based on context)

Comparison to Prior Work

vs. DeepSeek-R1/GRPO: LUFFY incorporates off-policy data directly into the RL update loop, whereas standard GRPO is strictly on-policy
vs. SFT+RL: LUFFY performs dynamic mixing of exploration and imitation during RL, rather than sequential stages, preventing the exploration drift seen in two-stage methods
vs. DPO [not cited in paper]: LUFFY optimizes for verifiable correctness rewards rather than preference rankings

Limitations

Requires high-quality off-policy reasoning traces (e.g., from DeepSeek-R1), which may not exist for all domains
Training dynamics initially dip as the model adjusts to external guidance before improving
Computational cost is higher than SFT (though comparable to other RL methods)

Reproducibility

Code: https://github.com/ElliottYan/LUFFY

Code is publicly available. Training data is a subset of OpenR1-Math-220k (public). Off-policy traces are from DeepSeek-R1 (open weights). Base models are open (Qwen, Llama). Reward function uses Math-Verify (open).

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with exact match evaluation

Benchmarks:

AIME 2024 / 2025 (Competition Math)
AMC (Competition Math)
MATH-500 (Competition Math)
Minerva (Scientific Reasoning)
OlympiadBench (Olympiad Math)
ARC-c (Out-of-Distribution Reasoning (ARC Challenge))
GPQA (Out-of-Distribution Reasoning (Graduate Science))
MMLU-Pro (Out-of-Distribution Reasoning (Professional Exams))

Metrics:

Pass@1
Avg@32 (for AIME/AMC)
Statistical methodology: p < 0.05 significance testing reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LUFFY consistently outperforms baselines on in-distribution math benchmarks, establishing a new SOTA for RLVR on Qwen2.5-Math-7B.
Average (6 Math Benchmarks)	Score	45.5	50.1	+4.6
Average (6 Math Benchmarks)	Score	43.7	50.1	+6.4
AIME 2025	Avg@32	15.3	23.1	+7.8
LUFFY shows strong generalization to out-of-distribution (OOD) tasks where on-policy methods struggle.
Average (ARC-c, GPQA, MMLU-Pro)	Score	51.6	57.8	+6.2
LUFFY enables training of smaller/weaker models where standard RL fails.
Average (6 Math Benchmarks)	Score	30.0	38.0	+8.0

Experiment Figures

Entropy analysis during training comparing On-policy, Mixed-policy, and LUFFY (Mixed-policy + Shaping).

Training rewards for Llama-3.1-8B on Easy vs. Hard datasets.

Main Takeaways

Off-policy guidance is critical for overcoming the 'cold start' problem in RLVR, enabling weak models (Llama-3.1-8B) to learn where on-policy methods completely fail.
Policy shaping prevents entropy collapse: without it, mixed-policy training leads to rigid imitation and loss of exploration (Fig 2).
LUFFY achieves a better balance of imitation and exploration than sequential SFT+RL, as evidenced by superior performance with significantly less compute (59% of SFT+RL GPU hours).
The method generalizes well to out-of-distribution tasks (GPQA, ARC-c), suggesting the model learns general reasoning patterns rather than just memorizing math solutions.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Policy Gradient, Importance Sampling)
Large Language Models (LLMs) and Chain-of-Thought (CoT)
Group Relative Policy Optimization (GRPO)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using binary pass/fail feedback on final answers (e.g., math problems)

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a separate value model

On-Policy: Learning strictly from data generated by the model's current policy

Off-Policy: Learning from data generated by a different policy (e.g., a stronger teacher model or historical data)

Importance Sampling: A technique to estimate properties of a target distribution using samples from a different proposal distribution by reweighting them

Mixed-Policy: Combining both on-policy rollouts (exploration) and off-policy demonstrations (guidance) in a single training batch or group

Policy Shaping: A proposed regularization technique that transforms importance sampling weights to prevent entropy collapse and encourage exploration of low-probability actions

Entropy Collapse: A failure mode where the model's output distribution becomes deterministic too quickly, reducing exploration and diversity

DeepSeek-R1: A strong reasoning model used in this paper as the source of off-policy guidance traces