Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

📝 Paper Summary

LLM Mathematical Reasoning Reinforcement Learning (RL) Analysis Reasoning Decomposition

RL training (specifically GRPO) improves math performance primarily by making models more robust at executing known solution paths (temperature distillation) rather than teaching them to plan solutions for fundamentally new problems.

Core Problem

Standard metrics like Pass@1 are too coarse to reveal which specific reasoning skills RL improves, failing to distinguish between genuine planning capabilities and simple execution robustness.

Why it matters:

Pass@1 evaluates greedy decoding, but real-world deployments use stochastic sampling where robustness matters
It is unclear if RL actually teaches new reasoning capabilities (planning/logic) or just reinforces existing knowledge
Understanding these limits is crucial for overcoming the 'coverage wall' where RL stops improving performance on novel problems

Concrete Example: A model might solve a math problem correctly only if specific keywords like 'block' are used instead of 'unit' (spurious correlation). RL fixes this by making the model robust to such phrasing (execution), but it doesn't help the model solve a problem where it lacks the initial idea of how to start (planning).

Key Novelty

Reasoning Decomposition Framework & Synthetic Tree Navigation Task

Decomposes math problem solving into three distinct skills: Plan (mapping questions to steps), Execute (performing steps), and Verify (checking results), rather than a single accuracy score
Identifies 'Temperature Distillation': RL flattens the precision curve across sampling temperatures, making correct answers likely even at high temperatures, but without expanding the set of solvable problems
Constructs a minimal synthetic solution-tree task to isolate planning from execution, proving that standard RL struggles to explore new solution paths without specific data conditions

Architecture

Conceptual framework decomposing problem solving into a tree search with Plan, Execute, and Verify steps

Evaluation Highlights

GRPO improves precision on training set problems (up to +45% on medium-difficulty problems) but fails to generalize to new test problems, hitting a 'coverage wall'
On the test set, peak improvement shifts to problems the model could already solve with majority voting (approx. 60% pre-GRPO precision), with zero new problems solved outside this range
RL significantly reduces elementary-level math and logic errors (execution failures) but does not meaningfully reduce high-school level factual errors

Breakthrough Assessment

8/10

Strong diagnostic work. It challenges the prevailing narrative that RL 'teaches' reasoning, rigorously showing it largely refines execution of known paths. The decomposition framework and synthetic task are valuable tools for future research.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving evaluated via sampling-based metrics and component decomposition

Inputs: Natural language math problem q

Outputs: Sequence of solution steps and final answer

Pipeline Flow

Generator (produces solution traces)
Verifier (checks answer correctness)
Annotator (GPT-4-based analysis of trace components)

System Modules

Generator

Generate reasoning steps and final answer for math problems

Model or implementation: Qwen2.5-Instruct (0.5B, 1.5B, 7B)

Annotator

Decompose solution traces into Plan and Execute components to evaluate specific failure modes

Model or implementation: GPT-4.1-mini

Modeling

Base Model: Qwen2.5-Instruct (0.5B, 1.5B, 7B)

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize reward relative to group average.

Formally: Not explicitly detailed in paper text (standard GRPO assumed)

Adaptation: Full fine-tuning assumed (implied by context of RL on base models)

Training Data:

Trained on MATH dataset
Evaluated on MATH-500

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: This paper analyzes *what* GRPO improves (execution vs. planning) rather than proposing a new method to beat it
vs. Concurrent work (e.g., Wang et al. 2025): Confirms GRPO doesn't improve coverage but adds the 'Planning vs. Execution' decomposition to explain *why*

Limitations

Analysis is primarily performed on Qwen2.5-1.5B (smaller scale), though 7B is used for reference
Relies on GPT-4-mini for annotating Plan/Execute validity, which may have its own biases
Focuses on high-school math (MATH benchmark); results might differ for competition-level problems where planning is harder
Synthetic task is a minimal abstraction and may not capture all complexities of natural language reasoning

Reproducibility

Code: https://github.com/cfpark00/RL-Wall

Code is publicly available at https://github.com/cfpark00/RL-Wall. Training hyperparameters are not explicitly listed in the main text. Annotations rely on GPT-4.1-mini.

📊 Experiments & Results

Evaluation Setup

Math problem solving with fine-grained failure analysis

Benchmarks:

MATH (High school mathematics)
MATH-500 (High school mathematics (test set))
Synthetic Tree Navigation (Synthetic solution-tree search) [New]

Metrics:

Pass@1
Precision (fraction of correct generations at specific temperature)
Coverage (problems solved by at least 1 of 64 samples at optimal temperature)
Plan Grade (fraction of traces with correct approach)
Execution Grade (fraction of traces with correct plan that yield correct answer)
Statistical methodology: Best-of-64 sampling across 5 random seeds to compute mean and standard deviation

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Coverage analysis shows GRPO does not expand the set of solvable problems.
MATH-500	Coverage (Pass@64)	350	350	0
Generalization gap analysis shows gains on training data do not transfer to test data for new problems.
MATH (Train Subset)	Precision Gain (Peak)	40	85	+45
MATH-500 (Test)	Precision Gain (Peak)	60	95	+35
Component analysis confirms improvements are in Execution, not Planning.
MATH-500	Execution Grade Improvement	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Precision curves across temperatures and Coverage analysis for Qwen2.5-1.5B

Matched-problem analysis comparing Pre- vs Post-GRPO precision on Train vs Test sets

Left: Visualization of solution tree divergence based on spurious words. Right: Reduction in error types after GRPO.

Main Takeaways

GRPO acts as a 'Temperature Distiller': it makes the model robust to sampling noise on problems it already knows, but does not teach it to solve fundamentally new problems (Coverage Wall).
Models scale better on Execution than Planning; even small models often have the correct plan but fail due to trivial logic/arithmetic errors, which GRPO fixes.
RL reduces elementary-level math errors and basic logic contradictions (purple bars in Fig 5), but fails to fix high-school level factual errors.
Synthetic experiments suggest RL can only break the coverage wall if the model can explore new actions and apply them to structurally similar problems (requires specific data properties).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Language Model Post-training (SFT, RL)
Sampling strategies (Temperature, Best-of-N)

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of outputs generated from the same input to reduce variance

Pass@1: The percentage of problems where the model's first generated answer is correct

Pass@K: The probability that at least one of K generated samples is correct

Temperature Distillation: The phenomenon where RL makes a model's performance robust to high sampling temperatures, flattening the precision curve

Coverage Wall: The limit of unique problems a model can solve even with infinite sampling (Pass@K as K approaches infinity), which RL fails to expand

Self-difficulty sorting: Ranking test problems based on the model's own precision (success rate) on them, rather than external difficulty labels

Plan Grade: The fraction of generated solutions that contain the correct sequence of high-level steps (approach) required to solve the problem

Execution Grade: The fraction of solutions with a correct plan that are also calculated/derived correctly to the final answer

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer