Absolute Zero: Reinforced Self-play Reasoning with Zero Data

📝 Paper Summary

Self-evolving Agentic reasoning Reinforcement Learning with Verifiable Rewards (RLVR) Code Generation

Absolute Zero Reasoner (AZR) enables LLMs to improve reasoning capabilities without human data by playing a dual role of task proposer and solver within a verified coding environment.

Core Problem

Current reasoning models rely on large-scale, human-curated datasets of questions and answers, which are becoming unsustainable to scale and may limit AI to human-level intelligence.

Why it matters:

Dependence on expert-curated data creates a scalability bottleneck similar to the one identified in LLM pretraining
Exclusive reliance on human-designed tasks restricts the capacity for autonomous AI learning and growth beyond human intellect
Existing self-play methods are often limited to narrow domains (like games) or rely on neural reward models prone to reward hacking

Concrete Example: In standard RLVR, a model needs a dataset of math problems and answers (e.g., from GSM8K) to learn. If the dataset runs out or lacks diversity, the model stops improving. AZR, conversely, autonomously generates a code induction task (e.g., 'write a function that transforms input A to output B') and learns by solving it.

Key Novelty

Absolute Zero (AZ) Paradigm

A single language model acts as both 'Proposer' (creating tasks) and 'Solver' (solving them), learning from the interaction without external data
Uses code execution as a grounded environment to verify three types of reasoning tasks: Deduction (output prediction), Abduction (input inference), and Induction (program synthesis)
The Proposer is rewarded for 'learnability' (tasks that are solvable but not trivial), while the Solver is rewarded for correctness, creating an auto-curriculum

Architecture

The Absolute Zero Reasoner (AZR) self-play loop

Evaluation Highlights

AZR-Coder-7B improves average math performance by +15.2 points over the base model without seeing any math data
Performance gains scale with model size: 3B, 7B, and 14B coder models gain +5.7, +10.2, and +13.2 points respectively
Outperforms previous 'zero' setting models (trained on in-domain data) by an average of 1.8 absolute points

Breakthrough Assessment

8/10

Proposes a viable path for open-ended self-improvement without human data, demonstrating that general reasoning (math) can emerge purely from self-proposed coding tasks.

⚙️ Technical Details

Problem Definition

Setting: Multitask Reinforcement Learning where a policy learns to propose and solve tasks in an environment

Inputs: For Proposer: Task type and past examples. For Solver: A generated query x (e.g., program + input)

Outputs: For Proposer: A task triplet (program, input, output). For Solver: The missing element of the triplet

Pipeline Flow

Proposer Role (Generates task proposal)
Environment (Validates & completes task triplet)
Solver Role (Predicts solution for task)
Environment (Verifies solution & calculates rewards)

System Modules

Proposer Policy

Generate task seeds (programs, inputs, or examples) conditioned on task type

Model or implementation: Shared LLM (e.g., Qwen-Coder-7B)

Task Validator

Execute proposed code to ensure validity, determinism, and safety

Model or implementation: Python Executor

Solver Policy

Solve the reasoning task defined by the triplet

Model or implementation: Shared LLM (e.g., Qwen-Coder-7B)

Novel Architectural Elements

Unified Proposer-Solver architecture where one model plays both roles to evolve its own training distribution
Integration of three code-based reasoning modes (abduction, deduction, induction) as a proxy for general reasoning

Modeling

Base Model: Qwen-Coder-7B (also tested with Qwen-7B, Qwen-Coder-3B/14B, Llama-3.1-8B)

Training Method: Reinforcement Learning (Task-Relative REINFORCE++)

Objective Functions:

Purpose: Maximize expected reward for both proposing and solving.

Formally: J(θ) = E[λ * r_propose + r_solve]
Purpose: Reward Proposer for tasks that provide learning signal.

Formally: r_propose = 1 - |2 * r_solve_avg - 1| (peaks when success rate is 0.5)
Purpose: Reward Solver for correct answers.

Formally: r_solve = 1 if verified correct, else 0

Adaptation: Full model update (implied)

Training Data:

Zero external data used
Seed buffer initialized with a single 'zero triplet' or simple seed set generated by the model itself

Key Hyperparameters:

seed_factor_S: 4
Monte_Carlo_rollouts_G: Not explicitly reported in the paper
balance_coefficient_lambda: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. AlphaZero: AZR operates in an open-ended code space rather than a fixed rule set
vs. Standard RLVR: AZR generates its own questions/tasks rather than relying on a fixed dataset
vs. ReAct [not cited in paper]: AZR naturally evolves 'comment-then-code' behavior similar to ReAct without explicit prompting

Limitations

Safety risks: The model may generate concerning 'uh-oh' chains of thought during self-play
Determinism constraint: Currently restricted to deterministic programs, limiting the scope of learnable tasks
Compute intensity: Requires Monte Carlo rollouts for learnability estimation

Reproducibility

Code availability is not explicitly provided in the paper text. Prompt templates for all three task types are provided in figures (Figures 36-41). The method relies on a standard Python executor environment.

📊 Experiments & Results

Evaluation Setup

Evaluation on standardized math and coding benchmarks after training solely on self-proposed tasks

Benchmarks:

Math Benchmarks (Mathematical Reasoning)
Coding Benchmarks (Code Generation/Reasoning)

Metrics:

Accuracy (Pass@1)
Math Average Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Self-play training (AZR) significantly improves reasoning capabilities across model sizes, even without seeing domain-specific data.
Math Average	Average Score	Not reported in the paper	Not reported in the paper	+10.9
Math Average	Average Score	Not reported in the paper	Not reported in the paper	+15.2
Math/Reasoning Average	Score Improvement	Not reported in the paper	Not reported in the paper	+5.7
Math/Reasoning Average	Score Improvement	Not reported in the paper	Not reported in the paper	+13.2

Experiment Figures

Example of emergent behavior where AZR interleaves comments and code

A 'safety alarm' example (the 'uh-oh moment')

Main Takeaways

General reasoning skills can emerge from code-centric self-play: training on code deduction/induction/abduction transfers significantly to math performance (+15.2 points for Coder-7B).
Code priors act as a multiplier: 'Coder' models gain more reasoning capability from AZR than standard base models (+15.2 vs +10.9).
Natural emergence of 'scratchpad' behavior: The model spontaneously learns to use comments as intermediate plans (like ReAct) to solve induction tasks.
Different reasoning modes trigger different behaviors: Abduction tasks lead to trial-and-error strategies and increased token counts, while deduction leads to step-by-step logic.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Large Language Models (LLMs)
Program Synthesis / Code Generation

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using outcome-based feedback (e.g., code compiles and runs) rather than human labels

Abduction: Reasoning mode: Inferring a plausible input given a program and an output (trial-and-error/search)

Deduction: Reasoning mode: Predicting the output given a program and an input (step-by-step execution)

Induction: Reasoning mode: Synthesizing a program given a set of input-output examples (generalization)

SFT: Supervised Fine-Tuning—training on labeled examples (not used here for the 'zero' paradigm)

TRR++: Task-Relative REINFORCE++—a newly proposed advantage estimator that normalizes baselines per task-role configuration

Learnability Reward: A reward signal for the Proposer that peaks when the Solver has a moderate success rate (neither 0% nor 100%), encouraging an appropriate curriculum

Uh-oh moment: A safety concern where the model generates concerning chains of thought during self-play