← Back to Paper List

Learning Robust Reasoning through Guided Adversarial Self-Play

Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Lizhang Chen, Amy Zhang, Liu Leqi
University of Texas at Austin, Honda Research Institute
arXiv (2026)
RL Reasoning Agent

📝 Paper Summary

Robustness in Reasoning Models Reinforcement Learning from Verifiable Rewards (RLVR) Adversarial Self-Play
GASP trains reasoning models to detect and repair errors in their own chain-of-thought by using an adversarial polluter to inject corruptions and an in-distribution guidance term to stabilize learning.
Core Problem
Strong reasoning models trained with RLVR are brittle: they optimize for final-answer correctness assuming clean context but fail catastrophically when conditioned on fallible context (e.g., corrupted partial solutions or distracting prompts).
Why it matters:
  • Real-world deployments often involve noisy inputs, collaborative reasoning with imperfect agents, or partial solution traces that may contain errors
  • Current models exhibit 'inverse scaling' on recoverability tests—stronger models are often more likely to blindly follow a corrupted step rather than correct it
  • Existing RLVR methods do not explicitly train the capability to distrust context, diagnose inconsistencies, or repair trajectories
Concrete Example: When a math model is given a chain-of-thought that makes a subtle calculation error halfway through, it often continues the reasoning based on the error rather than correcting it, even if it knows how to solve the problem correctly from scratch.
Key Novelty
Guided Adversarial Self-Play (GASP)
  • Adversarial Self-Play: A single model plays two roles—a 'polluter' that learns to inject subtle, failure-inducing corruptions into reasoning traces, and an 'agent' that learns to diagnose and fix them
  • In-Distribution Repair Guidance: Addresses the scarcity of successful repairs by cloning self-generated repair snippets (which are high-likelihood under the current policy) rather than off-distribution teacher fixes
Evaluation Highlights
  • Improves recoverability on GSM8K by +25-30% across multiple model sizes (1.5B to 8B) compared to RLVR baselines
  • Boosts diagnosability (identifying the first error step) on MR-GSM8K by over +40% compared to standard RLVR
  • Increases reliability under input perturbations (RUPBench) by +10-15% while often slightly improving clean accuracy
Breakthrough Assessment
8/10
Significantly improves robustness against internal and external errors without human labels or external teachers. The method elegantly solves the sparse reward problem in self-correction training.
×