← Back to Paper List

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
Shanghai AI Laboratory, Westlake University, Nanjing University, The Chinese University of Hong Kong
arXiv (2025)
RL Reasoning Benchmark

📝 Paper Summary

Large Reasoning Models (LRMs) Reinforcement Learning with Verifiable Rewards (RLVR)
LUFFY augments on-policy reinforcement learning with off-policy reasoning traces from stronger models using mixed-policy updates and gradient shaping to overcome exploration boundaries.
Core Problem
Standard on-policy RLVR (Reinforcement Learning with Verifiable Rewards) is constrained by the model's initial capabilities; if a model cannot spontaneously generate a correct reasoning chain, it cannot reinforce it, leading to failure in weak models.
Why it matters:
  • On-policy methods like those used in DeepSeek-R1 primarily amplify existing behaviors rather than teaching genuinely new reasoning skills
  • Weak foundation models (e.g., Llama-3.1-8B) often hit performance plateaus or fail completely (zero reward) on hard tasks because they lack the 'aha moments' needed to start the RL loop
Concrete Example: When training Llama-3.1-8B on a 'Hard' math subset, standard on-policy RL yields flat zero rewards because the model never generates a correct solution to learn from. In contrast, LUFFY uses off-policy traces to provide initial learning signals, successfully training the model.
Key Novelty
Mixed-Policy GRPO with Policy Shaping
  • Combines the model's own rollouts (on-policy) with correct reasoning traces from a stronger teacher model (off-policy) in the same group-based advantage computation, allowing the model to imitate when it fails and explore when it succeeds
  • Introduces 'policy shaping via regularized importance sampling,' which modifies the gradient weights to emphasize low-probability but correct actions from the teacher, preventing the model from lazily memorizing the teacher's style without understanding
Evaluation Highlights
  • +6.4 point average gain across six math benchmarks (including AIME and MATH-500) using Qwen2.5-Math-7B compared to previous RLVR methods
  • +6.2 point average improvement on out-of-distribution tasks (ARC-c, GPQA, MMLU-Pro), significantly outperforming the best baseline OpenReasoner-Zero (57.8 vs 51.6)
  • Successfully trains Llama-3.1-8B on hard tasks where standard On-Policy RL fails completely (0 reward)
Breakthrough Assessment
8/10
Addresses a fundamental limitation of the current RLVR paradigm (on-policy exploration bounds) with a theoretically grounded and empirically effective off-policy integration.
×