On the Learning Dynamics of RLVR at the Edge of Competence

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Compositional Reasoning Theoretical Analysis of Deep Learning

RLVR solves long-horizon reasoning tasks via a "relay effect" where smooth difficulty curricula bridge the gradient barrier, whereas large difficulty gaps cause stalling and grokking-like phase transitions.

Core Problem

RLVR relies on sparse, outcome-based rewards (correct/incorrect), which provide little signal for long-horizon reasoning tasks where the search space of trajectories is exponentially large.

Why it matters:

It remains a mystery how outcome-only feedback can drive learning in complex reasoning chains (like math or coding) without dense intermediate supervision
Understanding these dynamics is crucial for scaling reasoning models (like OpenAI-o3 or DeepSeek-R1) efficiently rather than relying on trial-and-error data mixing
Current empirical observations of 'grokking' (sudden learning after long plateaus) lack a rigorous theoretical mechanism explaining when and why they occur

Concrete Example: In a multi-step state tracking task (e.g., applying 45 sequential operations), a model initialized with random attention has a near-zero chance of guessing the final state correctly. Without intermediate feedback, the gradient is exponentially flat, and the model learns nothing for a long time.

Key Novelty

Theoretical framework for 'Relay Dynamics' vs 'Grokking' in RLVR

Identifies that the smoothness of the problem difficulty spectrum determines learning phases: smooth spectra allow easier problems to 'relay' gradient signals to slightly harder ones
Demonstrates that 'grokking' (long plateaus followed by jumps) arises specifically from discontinuities in difficulty, where the model must over-master an easy task before the next hard task provides any signal
Introduces a Fourier analysis framework on finite groups to mathematically estimate policy gradients for long-horizon compositional tasks, overcoming the intractability of trajectory-level probability calculations

Evaluation Highlights

Synthetic experiments show mixed-difficulty training with a moderate ratio (R=3) enables solving long-horizon tasks (Length=45), whereas fixed-length training at Length=45 fails completely (near-zero reward)
Large difficulty ratios (R=7) cause 'grokking': the model stalls at near-zero reward on longer tasks for extended periods before sudden mastery, confirming theoretical predictions of phase transitions
Short-horizon training (Length=5) succeeds rapidly with optimal rewards, while horizons beyond a critical threshold (approx. Length > 20) exhibit prolonged reward plateaus in the absence of a curriculum

Breakthrough Assessment

8/10

Provides the first rigorous theoretical explanation for why RLVR works 'at the edge of competence' and mechanistically explains the grokking phenomenon in reasoning tasks. Highly relevant to current LLM reasoning developments.

⚙️ Technical Details

Problem Definition

Setting: L-step compositional reasoning task (State Tracking) over a finite non-abelian simple group G

Inputs: A sequence of L transition tokens (pairs of position identifiers and group elements) and an initial state y_0

Outputs: The final state y_L resulting from applying the sequence of transitions: y_L = g_L(...g_1(y_0))

Pipeline Flow

Prompt Encoding (Position + Symbol)
Transformer Attention (Retrieve relevant transition g_k)
MLP Operation (Apply transition y_k = g_k(y_{k-1}))
Policy Output (Probability over next states)

System Modules

Attention Layer (Reasoning Core)

Locate the correct transition instruction g_k for the current step k from the prompt

Model or implementation: Single-head Softmax Attention (Learnable parameters Q)

MLP Layer (Reasoning Core)

Execute the atomic reasoning step (apply group operation)

Model or implementation: 1-hidden-layer MLP (Fixed/Frozen parameters W)

Novel Architectural Elements

Theoretical simplification: Decoupling the learning of 'association' (Attention) from 'execution' (MLP) to study compositional generalization limits

Modeling

Base Model: Minimalist Transformer (1 Attention Layer + 1 MLP Layer)

Training Method: REINFORCE (Policy Gradient)

Objective Functions:

Purpose: Maximize expected terminal reward (correct final answer).

Formally: J_L(θ) = E[r(y_hat_L | y_0, G_L)] where r is 1 if correct, 0 otherwise.
Purpose: Minimize supervised loss (intermediate supervision comparison).

Formally: Loss_L(θ) = E[ - sum(log π(y_k | y_{k-1}, G_L)) ] (used for SFT baseline only).

Adaptation: Full update of Attention parameters Q (MLP W is frozen)

Key Hyperparameters:

learning_rate: Not explicitly reported (theoretical analysis assumes polynomial scaling)
entropy_penalty: 1e-3 (in synthetic experiments)
batch_size: 512 (in synthetic experiments)
+ 1 more
momentum: 0.95 (for baseline moving average)

Compute: Not reported in the paper

Comparison to Prior Work

vs. SFT: SFT avoids the 'flat landscape' barrier of long horizons by providing intermediate signals; RLVR must rely on curriculum/relay effects to overcome this
vs. Standard RLVR analysis: This work introduces the 'difficulty ratio' R as a control parameter for grokking vs. relay dynamics, offering a mechanistic explanation for why 'edge of competence' training is necessary
vs. DeepSeek-R1-Zero / OpenAI o1 [not cited in paper]: Provides a theoretical justification for the 'Aha moments' (grokking) and the importance of staged training observed in these large reasoning models

Limitations

Analysis is restricted to a simplified transformer (1 layer) and a specific group-theory task
Assumes the MLP already perfectly possesses atomic skills (focuses only on learning composition)
Does not model other reasoning patterns like search or planning, only sequential state tracking
Theoretical bounds are asymptotic; exact constants may vary in real LLM training

Reproducibility

Theoretical paper with synthetic experiments. No code URL provided. Experimental details (group size Z_96, batch size 512) are sufficient to replicate the synthetic setup.

📊 Experiments & Results

Evaluation Setup

Synthetic State-Tracking Task over Cyclic Group Z_96

Benchmarks:

Synthetic State Tracking (Compositional Reasoning / Path Integration) [New]

Metrics:

Success Rate (Terminal Reward)
Peak Attention-Hit Rate (Alignment Metric)
Statistical methodology: Averaged over 30 batches of size 512

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fixed-length training demonstrates that RL fails on long horizons without a curriculum, while short horizons are learned easily.
Synthetic State Tracking	Success Rate	1.0	0.0	-1.0
Mixed-length training results show how difficulty ratio affects learning dynamics (Relay vs. Grokking).
Synthetic State Tracking (Mixed Length)	Success Rate (L=45)	0.0	1.0	+1.0
Synthetic State Tracking (Mixed Length)	Convergence Speed / Plateau Length	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Schematic of Grokking vs. Relay dynamics

Main Takeaways

Curriculum matters: Outcome-based RL fails on long-horizon tasks unless the training data includes a mixture of easier tasks to bootstrap the gradient signal.
The 'Relay Effect': When the difficulty gap is small (e.g., ratio R=3), gradients from easier tasks (e.g., L=15) help the model start learning harder tasks (e.g., L=45) before the easy tasks are even fully mastered.
Grokking mechanism: When the difficulty gap is large (e.g., R=7), the model must fully master the easy task before random successes on the hard task provide enough signal, leading to long plateaus followed by sudden jumps.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients/REINFORCE)
Transformer Architecture (Attention mechanisms)
Group Theory (Basic concepts of groups and actions)
Fourier Analysis (Convolution theorem)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using outcome-based feedback (e.g., correct answer) to train reasoning models

Grokking: A learning dynamic where performance remains flat (near chance) for a long period before suddenly jumping to high accuracy

Relay Effect: A mechanism where gradients from solving easier/shorter tasks improve the model just enough to make slightly harder tasks solvable, creating a continuous chain of progress

Edge of Competence: The difficulty regime where a model has non-trivial success rates (not pure guessing) but has not yet mastered the task; the optimal zone for RL training

REINFORCE: A basic policy gradient algorithm that updates model parameters based on the product of the reward and the gradient of the log-probability of the action

SFT: Supervised Fine-Tuning—training on dataset examples with immediate supervision (ground truth next-token targets)

Fourier Analysis on Groups: A mathematical technique decomposing functions on a group into irreducible representations, used here to analyze convolution of probability measures

Atomic Skill: The basic single-step operation (e.g., one group multiplication) which the model's MLP is assumed to already possess, leaving the Attention layer to learn composition