← Back to Paper List

Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization

Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li
arXiv.org (2026)
RL Reasoning Benchmark

📝 Paper Summary

Mathematical Reasoning Reinforcement Learning with Verifiable Rewards (RLVR) Latent Space Geometry
MRPO expands LLM reasoning capabilities by ejecting the policy into the null space of the pre-training bias manifold via orthogonal exploration, then stabilizing it with a spectral rank-aware reward.
Core Problem
Standard RL alignment methods (like PPO/DPO) constrain models to a low-rank 'Bias Manifold' of pre-existing stylistic norms, effectively placing a ceiling on reasoning capacity by inhibiting the exploration of complex, high-dimensional solution paths.
Why it matters:
  • Current alignment acts as a 'tax,' lobotomizing latent reasoning capacity in favor of safety and convergence
  • Pure RL (like DeepSeek-R1) allows expansion but suffers from stability issues like reward hacking and language mixing due to lack of geometric constraints
  • The 'Superficial Alignment Hypothesis' suggests standard methods only elicit pre-existing capabilities rather than injecting new ones
Concrete Example: A standard RL-aligned model often answers a complex math problem using a safe, memorized heuristic (low effective rank) that looks fluent but fails on novel variations. In contrast, MRPO forces the model to explore orthogonal 'null space' trajectories, discovering a high-dimensional, first-principles derivation that standard greedy search would never sample.
Key Novelty
Manifold-Reshaping Policy Optimization (MRPO)
  • Geometrically 'ejects' the model from its pre-trained bias manifold using a Student-Guides-Teacher cold-start, where a weaker model helps probe the teacher's null space for novel trajectories
  • Integrates an Effective Rank spectral reward into the RL objective, mathematically penalizing the natural tendency of RL policies to collapse into low-entropy, repetitive reasoning patterns
Evaluation Highlights
  • 56.7% accuracy on AIME 2024 with a 4B parameter model, outperforming the significantly larger Qwen3-32B (33.3%) by 23.4%
  • 84.2% accuracy on MATH-500, surpassing state-of-the-art 14B models like Qwen2.5-14B-SimpleRL despite having 3x fewer parameters
  • Achieves 49.8% mean accuracy across five math benchmarks, improving over the standard GRPO baseline (46.0%) while maintaining comparable token costs
Breakthrough Assessment
9/10
Offers a fundamental geometric explanation for the 'alignment tax' and provides a concrete, mathematically grounded solution (Effective Rank regularization) that allows small models to beat much larger ones on reasoning tasks.
×