← Back to Paper List

GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning

Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, Dandan Tu
Huawei Research, Huawei Noah’s Ark Lab, City University of Hong Kong
arXiv.org (2025)
RL Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Large Language Model Reasoning Mathematical Problem Solving
GHPO stabilizes reasoning model training by dynamically detecting difficult problems and switching from pure reinforcement learning to trace-guided imitation learning, preventing reward sparsity.
Core Problem
RLVR methods like GRPO suffer from 'capacity-difficulty mismatch,' where training data is too hard for the model's current capability, leading to zero-reward trajectories and stalled learning.
Why it matters:
  • Standard on-policy RL fails when the model cannot find a single correct solution, resulting in vanishing gradients and wasted computation
  • Smaller, on-device models (e.g., 7B parameters) are particularly vulnerable, failing over 50% of competition-level math problems even before training starts
  • Existing curriculum learning requires manual partitioning, and dynamic sampling (discarding hard data) is data-inefficient
Concrete Example: On the NuminaMath-1.5 dataset, a Qwen2.5-7B-Instruct model fails to solve 52% of problems. In standard GRPO, these problems yield a group of incorrect responses (all zero rewards), causing the advantage estimate to be zero and providing no learning signal.
Key Novelty
Guided Hybrid Policy Optimization (GHPO)
  • Dynamically assesses problem difficulty on-the-fly during training rather than using static dataset partitions
  • Uses a hybrid strategy: applies standard exploration-based RL for manageable tasks, but seamlessly switches to imitation learning with partial solution traces for tasks where the model fails
  • Leverages 'partial ground truth' to steer the model towards correct answers on hard problems, creating valid gradient signals where they would otherwise be zero
Evaluation Highlights
  • Achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks (claimed in abstract)
  • Outperforms strong on-policy reinforcement learning (GRPO) and curriculum learning baselines (claimed in abstract)
  • Significantly enhances training stability and sample efficiency compared to standard on-policy methods
Breakthrough Assessment
7/10
Addresses a critical bottleneck in RLVR (reward sparsity) with a logical hybrid approach. While the core idea of 'guiding' is known, the dynamic, adaptive integration into the GRPO loop is a practical advancement for reasoning models.
×