← Back to Paper List

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu
Tencent AI Lab in Bellevue, WA, USA
arXiv (2026)
RL Reasoning

📝 Paper Summary

LLM Post-training Reinforcement Learning for Reasoning
The paper replaces uniform training in LLM reasoning with two adversarial controllers that dynamically upweight hard prompt groups and reallocate rollout budgets to maximize learning on the difficulty frontier.
Core Problem
Standard RL pipelines for reasoning assume static uniformity—sampling prompts uniformly and assigning fixed rollout budgets—which wastes compute on solved easy problems while under-training on the long tail of hard problems.
Why it matters:
  • Reasoning datasets are heavy-tailed; uniform sampling causes models to over-optimize the 'easy core' while neglecting difficult edge cases
  • Fixed rollout budgets fail to capture that 'frontier' prompts require massive exploration to reduce gradient variance, while solved prompts yield low-variance signals
  • Existing methods lack a mechanism to automatically 'steer' training toward the evolving difficulty landscape of the model
Concrete Example: In a math dataset containing both elementary algebra and Olympiad number theory, uniform sampling dominates updates with frequent algebra problems the model already solved. Meanwhile, the Olympiad problems (which require more rollouts to find a correct path) are sampled rarely and given insufficient budget, stalling progress.
Key Novelty
Multi-Adversary Group Distributionally Robust Optimization (GDRO)
  • Prompt-GDRO: An adversary that dynamically reweights training data based on real-time difficulty (pass@k), forcing the model to focus on groups with high intensive loss (difficulty) rather than high frequency.
  • Rollout-GDRO: A second adversary that redistributes the rollout budget across groups (e.g., fewer for easy, more for hard) to maximize gradient variance reduction while maintaining a fixed global compute budget.
Evaluation Highlights
  • +13.13% relative gain in pass@8 accuracy on DAPO 14.1k dataset with Qwen3-4B-Base using Prompt-GDRO compared to GRPO baseline.
  • +10.64% relative gain in pass@8 accuracy with Qwen3-1.7B-Base using Rollout-GDRO compared to GRPO baseline.
  • Consistently outperforms GRPO across 1.7B, 4B, and 8B scales, demonstrating scalability of the adversarial framework.
Breakthrough Assessment
8/10
Significant methodology improvement by introducing dynamic, data-agnostic difficulty grouping and adversarial resource allocation to RLHF. Strong empirical gains on reasoning benchmarks.
×