Learning Robust Reasoning through Guided Adversarial Self-Play

📝 Paper Summary

Robustness in Reasoning Models Reinforcement Learning from Verifiable Rewards (RLVR) Adversarial Self-Play

GASP trains reasoning models to detect and repair errors in their own chain-of-thought by using an adversarial polluter to inject corruptions and an in-distribution guidance term to stabilize learning.

Core Problem

Strong reasoning models trained with RLVR are brittle: they optimize for final-answer correctness assuming clean context but fail catastrophically when conditioned on fallible context (e.g., corrupted partial solutions or distracting prompts).

Why it matters:

Real-world deployments often involve noisy inputs, collaborative reasoning with imperfect agents, or partial solution traces that may contain errors
Current models exhibit 'inverse scaling' on recoverability tests—stronger models are often more likely to blindly follow a corrupted step rather than correct it
Existing RLVR methods do not explicitly train the capability to distrust context, diagnose inconsistencies, or repair trajectories

Concrete Example: When a math model is given a chain-of-thought that makes a subtle calculation error halfway through, it often continues the reasoning based on the error rather than correcting it, even if it knows how to solve the problem correctly from scratch.

Key Novelty

Guided Adversarial Self-Play (GASP)

Adversarial Self-Play: A single model plays two roles—a 'polluter' that learns to inject subtle, failure-inducing corruptions into reasoning traces, and an 'agent' that learns to diagnose and fix them
In-Distribution Repair Guidance: Addresses the scarcity of successful repairs by cloning self-generated repair snippets (which are high-likelihood under the current policy) rather than off-distribution teacher fixes

Evaluation Highlights

Improves recoverability on GSM8K by +25-30% across multiple model sizes (1.5B to 8B) compared to RLVR baselines
Boosts diagnosability (identifying the first error step) on MR-GSM8K by over +40% compared to standard RLVR
Increases reliability under input perturbations (RUPBench) by +10-15% while often slightly improving clean accuracy

Breakthrough Assessment

8/10

Significantly improves robustness against internal and external errors without human labels or external teachers. The method elegantly solves the sparse reward problem in self-correction training.

⚙️ Technical Details

Problem Definition

Setting: Episodic decision-making where a model generates a chain-of-thought and final answer, receiving binary verifiable rewards

Inputs: A conditioning context (question q, potentially corrupted prefix c)

Outputs: A completed reasoning trajectory ending in a final answer a

Pipeline Flow

Polluter Generation (creates corrupted context)
Agent Rollout (attempts to solve from corrupted context)
Outcome Verification (checks final answer correctness)
Optimization (updates both Polluter and Agent)

System Modules

Polluter

Generates a 'polluted steer' by replacing a window of a correct trajectory with a plausible but misleading corruption

Model or implementation: Same as Agent (shared weights, distinguished by system prompt)

Agent

Generates the remainder of the solution given the polluted context

Model or implementation: Target LLM (1.5B - 8B parameters)

Novel Architectural Elements

Adversarial self-play loop within a single model weights set (roles defined by prompts)
Integration of self-generated repair imitation (J_guide) into the RLVR loop

Modeling

Base Model: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Llama-3.1-8B-Instruct

Training Method: Guided Adversarial Self-Play (GASP) using GRPO

Objective Functions:

Purpose: Optimize agent to recover from corruption.

Formally: J_GRPO with reward R_rec = I{a_off = a_ground_truth}
Purpose: Optimize polluter to induce agent failure.

Formally: J_GRPO with reward R_poll = 1 - I{a_off = a_ground_truth}
Purpose: Guide agent toward valid repairs using self-generated samples.

Formally: J_guide = E[log pi(w_fix | s_poll)] for successful on-policy repairs w_fix

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
group_size_G: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLVR: GASP trains on adversarial corruptions rather than just clean prompts
vs. SFT: GASP uses online generation of corruptions and outcome-based rewards rather than fixed datasets
vs. STaR: GASP focuses on robustness to error (recovery) rather than just solving from scratch [not cited in paper]

Limitations

Relies on the existence of verifiable outcome rewards (math/code)
Computational cost is higher due to the adversarial generation step
Requires the base model to have some initial capability to reason and repair

Reproducibility

Code availability is not provided. Detailed prompts for the polluter and diagnosis are described conceptually but exact templates are not fully listed. Hyperparameters like learning rate and batch size are missing.

📊 Experiments & Results

Evaluation Setup

Evaluation of reasoning robustness on math and logic tasks

Benchmarks:

GSM8K (Grade School Math)
MATH (Challenging Math Problems)
MR-GSM8K (Diagnosability (identifying errors))
RUPBench (Reasoning under perturbations)

Metrics:

Recoverability Rate (correct answer given corrupted context)
Diagnosability Score (correctly identifying error location)
Pass@1 (Standard Clean Accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Recoverability results show GASP significantly outperforms baselines in solving problems given a corrupted partial chain-of-thought.
Recoverability (GSM8K)	Success Rate	45.0	72.0	+27.0
Recoverability (GSM8K)	Success Rate	50.0	80.0	+30.0
Diagnosability results demonstrate improved ability to detect errors in provided solutions.
MR-GSM8K	F1 / Accuracy	20.0	62.0	+42.0
Clean accuracy is maintained or improved, unlike methods that trade off clean performance for robustness.
GSM8K	Pass@1	78.0	81.0	+3.0

Experiment Figures

Comparison of likelihoods for self-generated repairs vs. teacher-generated repairs under the polluted context.

Main Takeaways

Adversarial self-play creates an effective curriculum: as the agent improves, the polluter generates subtler bugs, driving further agent improvement.
In-distribution guidance is critical: simply training on outcomes fails because recoveries are too rare early in training.
Robustness transfers: Training on self-generated corruptions improves performance on external benchmarks like RUPBench and MR-GSM8K.
GASP reverses the 'inverse scaling' trend where larger models are more easily misled by context.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Verifiable Rewards (RLVR)
Chain-of-Thought (CoT) prompting
Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO)

Key Terms

RLVR: Reinforcement Learning from Verifiable Rewards—optimizing models based on whether the final answer is correct, without human labels for intermediate steps

GRPO: Group Relative Policy Optimization—a policy gradient method that normalizes advantages within a group of samples for the same input to reduce variance

CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before the final answer

polluter: A role played by the model during training where it generates corrupted reasoning steps intended to mislead the agent

agent: The primary role of the model where it attempts to solve the problem or recover from corrupted context

recoverability: The ability of a model to produce a correct final answer despite starting with a partially incorrect reasoning trace

diagnosability: The ability to identify the specific step where a reasoning trace went wrong

inverse scaling: A phenomenon where larger or more capable models perform worse on a specific metric (here, susceptibility to following errors)

in-distribution guidance: A training objective that encourages the model to imitate its own successful repair trajectories, ensuring updates remain close to the model's current capabilities