When Is Compositional Reasoning Learnable from Verifiable Rewards?

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Compositional Generalization Chain of Thought (CoT) Reasoning

RLVR can efficiently learn correct compositional reasoning from outcome feedback only when correct intermediate steps provide a statistical "task-advantage" in verification probability compared to incorrect steps.

Core Problem

RLVR provides global feedback (pass/fail) for a sequence of decisions, making it ambiguous which intermediate steps were responsible for success or failure.

Why it matters:

Models may converge to suboptimal "shortcuts" or fail to learn correct reasoning chains even with infinite data if the feedback signal doesn't propagate correctly
It is unknown why RLVR succeeds on some reasoning tasks but fails or degrades performance on others
Understanding the theoretical limits of outcome-based supervision is critical as supervision of intermediate steps is often costly or unavailable

Concrete Example: In a multi-step math problem, a model might use an incorrect formula but still arrive at the correct final answer due to luck or cancellation of errors. If this 'lucky' path passes the verifier as often as the rigorous path (low task-advantage), RLVR will not distinguish between them, potentially reinforcing the flaw.

Key Novelty

Task-Advantage Ratio

Introduces a theoretical quantity called the 'task-advantage ratio': the probability of verification success when a specific task is selected versus when it is not
Proves that the gradient update direction for any intermediate step is strictly governed by this ratio
Establishes that efficient learning requires this ratio to be favorable; otherwise, the model faces exponential complexity or suboptimal convergence

Architecture

Conceptual illustration of decomposing a complex problem (adding two large numbers) into a sequence of simpler autoregressive tasks.

Evaluation Highlights

Proved that when the task-advantage ratio condition holds, RLVR converges to the correct composition in O(S²) iterations (quadratic in Chain of Thought length)
Demonstrated theoretically that without structural advantage, RLVR may converge to suboptimal compositions even without representational barriers
Identified that poor base model quality can provably prevent learning by lowering the task-advantage ratio, explaining why weak models fail to improve with RLVR

Breakthrough Assessment

8/10

Provides a fundamental theoretical grounding for RLVR, explaining both its successes and failures through a single structural property. While theoretical, it offers essential insights into the limits of outcome-supervised reasoning.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive generation of task sequences trained via REINFORCE on positive outcomes

Inputs: Initial prompt x₀ sampled from distribution D

Outputs: Sequence of tokens/tasks leading to a final output x_S that satisfies a binary verifier V(x_S)

Pipeline Flow

Input Processing: Prompt x₀ → Model f_θ
Reasoning: Step-by-step selection of Tasks σ_j
Verification: Final output x_S → Verifier V

System Modules

Task Selection Policy

Selects the next deterministic task to apply based on the current prefix

Model or implementation: Linear combination of task features (Abstracted LLM)

Verifier

Evaluates the final output for correctness

Model or implementation: Black-box function V

Novel Architectural Elements

Modeling the reasoning process explicitly as a sequence of discrete task selections with orthonormal positional embeddings for theoretical analysis

Modeling

Base Model: Abstracted autoregressive model with linear task features

Training Method: REINFORCE (Policy Gradient)

Objective Functions:

Purpose: Maximize the expected probability of generating a sequence that passes the verifier.

Formally: Update θ in direction of ∇ log P_θ(x) for successful samples.

Key Hyperparameters:

learning_rate: η (theoretical constant)
gamma: Task logit scaling factor (assumed large)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Supervised CoT: Analyzes the harder setting where NO intermediate supervision is available
vs. Standard RLVR empirical papers: Provides theoretical bounds on WHEN RLVR works, rather than just showing empirical success
vs. STAR (Self-Taught Reasoner) [not cited in paper]: STAR iteratively curates rationales; this work analyzes the fundamental learnability conditions of such iterative loops

Limitations

Analysis relies on the assumption of non-overlapping tasks (each task produces unique outputs)
Assumes a simplified linear model of task selection rather than full non-linear Transformer dynamics
Results are asymptotic/theoretical bounds rather than empirical benchmarks on modern LLMs

Reproducibility

Theoretical paper. Detailed proofs are provided for Theorems 5.2 and 5.4. No code or datasets are required for replication of the theoretical results.

📊 Experiments & Results

Evaluation Setup

Theoretical analysis of convergence properties for autoregressive compositional problems under RLVR

Metrics:

Sample complexity (number of iterations to convergence)
Convergence probability
Statistical methodology: Mathematical proof

Main Takeaways

Outcome-based feedback is sufficient for learning compositional reasoning ONLY if the problem has 'inductive structure' (partially correct chains imply higher success probability).
If the base model is too weak (random guessing on intermediate steps), the task-advantage ratio vanishes, and RLVR fails to learn even simple compositions.
The learning time scales quadratically with the length of the chain of thought O(S²) in favorable conditions, but can be exponential otherwise.
The theoretical findings explain empirical observations where RLVR 'sharpens' existing capabilities but fails to teach completely new compositional skills from scratch.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (REINFORCE algorithm)
Autoregressive Language Models
Chain of Thought (CoT) reasoning

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using binary feedback (correct/incorrect) on the final output rather than step-by-step supervision

CoT: Chain of Thought—a reasoning technique where models generate intermediate steps before producing a final answer

Task-Advantage Ratio: The ratio between the probability of eventual verification success when a specific intermediate task is selected versus when it is not selected

Autoregressive composition: A function computed by applying a sequence of simpler tasks (functions) one after another, where each step depends on the previous output

REINFORCE: A policy gradient algorithm that updates model parameters to increase the probability of actions that yield high rewards

Kleene closure: The set of all finite strings formed by concatenating symbols from a vocabulary

Inductive structure: A property of a problem where partially correct chains of thought yield a higher probability of final success than incorrect ones