RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

📝 Paper Summary

LLM Reasoning Reinforcement Learning (RL) Post-training

Tango improves LLM reasoning by jointly training a generator and a generative verifier via interleaved reinforcement learning, allowing the verifier to learn step-level critique strategies solely from outcome signals.

Core Problem

Current RL post-training methods rely on fixed or supervised-fine-tuned (SFT) verifiers, which cannot adapt to the generator's evolving capabilities and are vulnerable to reward hacking.

Why it matters:

Fixed verifiers (rule-based or frozen) limit the generator's potential ceiling and fail to generalize to new reasoning paths
SFT-trained verifiers are imitation-based, lacking the exploration needed to robustly distinguish correct reasoning from plausible-sounding errors
When a generator improves, a static verifier becomes the bottleneck, unable to provide meaningful feedback on increasingly complex trajectories

Concrete Example: A generator might produce a correct final answer using flawed logic (a false positive). A fixed verifier trained on static data might miss this subtle logical error. Tango's verifier, evolving alongside the generator, learns to penalize such 'lucky guesses' because it is optimized to align its step-judgments with true correctness over time.

Key Novelty

Generative Co-Evolutionary RL (Tango)

Treats the Verifier as a policy trained via RL, not just a fixed regression model, allowing it to explore and refine its grading logic
Interleaves training: the Generator improves using Verifier feedback, and the Verifier improves by learning to better predict the correctness of the Generator's new outputs
Eliminates the need for expensive step-level human annotations; the Verifier learns 'what makes a step correct' purely from the final answer's correctness signal

Architecture

The overall Tango framework illustrating the interleaved training loop between the Generator and Verifier.

Evaluation Highlights

Achieves an average relative improvement of 25.5% across five competition-level math benchmarks compared to vanilla RL (GRPO) baselines
Doubles accuracy on the AIME 2025 benchmark relative to vanilla GRPO, demonstrating effectiveness on the hardest tasks
Verifier establishes a new state-of-the-art on ProcessBench, outperforming the much larger Qwen2.5-Math-72B-Instruct despite being a 7B model trained without process labels

Breakthrough Assessment

9/10

Proposes a fundamental shift from fixed/SFT verifiers to RL-trained generative verifiers. The ability to learn process-level verification from outcome-only signals via co-evolution is a significant methodological advance.

⚙️ Technical Details

Problem Definition

Setting: Joint optimization of a generator policy and a verifier policy using reinforcement learning with only outcome-level supervision

Inputs: Natural language question x

Outputs: Multi-step reasoning chain o_g and a verification response o_v

Pipeline Flow

Group: Generation & Verification → Generator produces outputs; Verifier critiques them
Group: Reward Computation → Outcomes compared to ground truth; Steps scored by Verifier judgments
Group: Optimization → RL updates applied to both policies

System Modules

Generator (Generation & Verification)

Policy model that generates step-by-step reasoning trajectories

Model or implementation: Qwen2.5-7B (initialized)

Verifier (Generation & Verification)

Policy model that generates natural language critiques and step-wise correctness tags

Model or implementation: Qwen2.5-7B (initialized)

Novel Architectural Elements

Generative Verifier Architecture: The verifier is a full LLM outputting text-based critiques, not a scalar value head attached to a frozen trunk
Co-Evolutionary Loop: The system explicitly connects the Generator and Verifier in a dual-RL update cycle where the Verifier's training data is dynamically generated by the Generator

Modeling

Base Model: Qwen2.5-7B

Training Method: Interleaved Reinforcement Learning (specifically GRPO)

Objective Functions:

Purpose: Optimize Generator policy.

Formally: Maximize weighted sum of Outcome Advantage (from ground truth) and Step Advantage (from Verifier judgments)
Purpose: Optimize Verifier policy.

Formally: Maximize alignment between Verifier's final judgment and ground truth correctness, plus format compliance rewards
Purpose: Balance Verifier training on imbalanced data.

Formally: Class-aware reweighting (s+, s-) to prevent Verifier from collapsing to always predicting 'Incorrect' early in training

Key Hyperparameters:

rl_algorithm: GRPO
alpha_decay: Exponential decay for step-level reward weight
verifier_reweighting: Inverse square root of class counts (Correct vs Incorrect)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PRIME: PRIME uses a discriminative, logit-based PRM trained via SFT. Tango uses a generative PRM trained via RL, offering better generalization and robustness.
vs. DeepSeek-Math-GRPO: DeepSeek uses a static verifier. Tango's verifier co-evolves, preventing the generator from outsmarting a fixed reward model.
vs. Outcome-Only RL: Tango provides dense step-level signals inferred by the verifier, rather than sparse final rewards, accelerating learning.

Limitations

Computational cost of generative verification is higher than discriminative scalar heads (requires token generation)
Requires careful hyperparameter tuning (e.g., alpha decay) to balance step vs. outcome rewards
Training stability can be sensitive to the initial imbalance of correct/incorrect solutions

Reproducibility

Code: https://github.com/kaiwenzha/rl-tango

Code is publicly available at https://github.com/kaiwenzha/rl-tango. The verifier initialization uses Qwen2.5-7B. Step-level annotations are not required for training.

📊 Experiments & Results

Evaluation Setup

Mathematical and Out-of-Domain Reasoning Tasks

Benchmarks:

AIME 2024 / 2025 (Competition Math)
AMC 2023 (Competition Math)
MATH (Challenging Math Problems)
ProcessBench (Process-level Verification)

Metrics:

Accuracy (Pass@1)
Step-level Verification Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative results showing Tango's relative improvements. Note: Absolute numeric values for accuracy were not extractable from the provided text snippet, so relative gains cited in the text are reported here implicitly via the Highlights section, and this section is kept minimal to avoid fabrication.

Experiment Figures

Comparison of Tango against baselines across varying task difficulties.

Main Takeaways

Tango consistently outperforms vanilla RL (outcome-only) and SFT-based co-training (PRIME) across multiple benchmarks, validating the benefit of RL-trained verifiers.
The method is particularly effective on hard tasks: gains are highest on AIME 2025 (doubling accuracy vs vanilla GRPO), suggesting the verifier helps navigate complex search spaces.
The verifier learns effective step-level grading solely from outcome signals, achieving SOTA on ProcessBench without ever seeing a human-annotated step label.
Class-aware reweighting is critical for verifier training to prevent collapse when the generator initially produces mostly incorrect answers.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Large Language Models (LLMs)
Chain-of-Thought (CoT) Reasoning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance

PRM: Process Reward Model—a verifier that scores each step of a reasoning chain rather than just the final answer

ORM: Outcome Reward Model—a verifier that scores only the final result of a reasoning chain

SFT: Supervised Fine-Tuning—training a model to mimic labeled examples

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps

Reward Hacking: When an RL agent exploits flaws in the reward function to get high scores without actually solving the task correctly

Generative Verifier: A verifier that outputs a natural language critique and labels (as tokens) rather than just a numerical score

Interleaved Training: Alternating training phases between two models (Generator and Verifier) so they adapt to each other