Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

📝 Paper Summary

LLM Reasoning Transfer Learning Mathematical Reasoning

Improvements in math reasoning transfer to general domains only when trained via on-policy reinforcement learning, as supervised fine-tuning causes representation drift and catastrophic forgetting.

Core Problem

Models tuned for math reasoning often achieve high benchmarks scores but fail to transfer these gains to other domains (coding, planning) or suffer catastrophic forgetting on general tasks.

Why it matters:

Math is used as a proxy for general reasoning, but if gains don't transfer, leaderboards may reflect overfitting rather than true intelligence
Current post-training recipes rely heavily on SFT on distilled data, which this paper suggests damages general capabilities compared to RL
Applications require broad competence; a model that solves Olympiad math but cannot follow simple instructions is practically limited

Concrete Example: A model fine-tuned via SFT on math data might learn to inject reasoning markers like '<<' and '>>' into non-reasoning instruction-following tasks, or lose the ability to answer simple conversational questions, despite scoring high on AIME.

Key Novelty

UniReason: Controlled comparison of SFT vs RL for Transferability

Demonstrates that the 'fine-tuning paradigm' (RL vs. SFT) is the primary predictor of whether reasoning skills transfer to other domains
Uses latent-space PCA and token-space KL divergence analysis to prove SFT distorts internal representations while RL preserves the base model's geometry

Architecture

Conceptual flow of the controlled study (UniReason) comparing SFT and RL paradigms on the same data source

Evaluation Highlights

UniReason-RL achieves 55.7% on AIME24 and 87.8% on MATH500, outperforming SFT counterparts while maintaining general skills
RL model gains +17.1% accuracy on LiveCodeBench2 compared to the SFT model trained on the exact same data
RL reduces token distribution shift significantly: 0.019 KL divergence on IFEval vs. 0.283 for SFT

Breakthrough Assessment

9/10

Provides a crucial pivot for the field by empirically proving RL's superiority over SFT for generalizable reasoning, backed by rigorous latent space analysis and controlled experiments.

⚙️ Technical Details

Problem Definition

Setting: Post-training of Large Language Models (LLMs) to enhance reasoning capabilities

Inputs: Mathematical reasoning prompts (x)

Outputs: Reasoning traces and final answers (y)

Pipeline Flow

Data Preparation: High-quality math dataset (MATH + DeepScaler)
Teacher Generation (for SFT targets): Qwen3-32B generates CoT traces
Rejection Sampling: Filter for correct teacher traces
Model Training (Controlled): Qwen3-14B tuned via SFT (on traces) OR RL (on correctness reward)

System Modules

Base Model

The pre-trained LLM being improved

Model or implementation: Qwen3-14B-Base

Teacher Model

Generates high-quality reasoning traces for SFT targets

Model or implementation: Qwen3-32B-Instruct

RL Trainer

Optimizes the model using correctness rewards

Model or implementation: GRPO (Group Relative Policy Optimization)

Novel Architectural Elements

UniReason framework: A controlled training setup designed strictly to isolate the effect of optimization method (RL vs SFT) on transferability

Modeling

Base Model: Qwen3-14B-Base

Training Method: Compared Supervised Fine-Tuning (SFT) and Reinforcement Learning (GRPO)

Objective Functions:

Purpose: SFT minimizes negative log-likelihood of teacher traces.

Formally: L_SFT = -E[log π_θ(y|x)]
Purpose: RL maximizes expected reward (correctness).

Formally: L_RL = E[A(x,y) * π_θ(y|x)]

Training Data:

Derived from MATH and DeepScaler datasets
SFT targets generated by Qwen3-32B-Instruct via rejection sampling

Key Hyperparameters:

rl_algorithm: GRPO
reward_function: Answer correctness (binary)
sampling_n: 8 (for off-policy RL comparison)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT (e.g. DeepScaler recipes): This paper identifies SFT as the cause of poor transferability and proposes RL as the solution for general capabilities
vs. Standard RL: This paper focuses on the *transfer* of math skills to non-math domains, rather than just math performance maximization
vs. Llama-3-Post-Training [not cited in paper]: Comparison to general post-training recipes, highlighting that math-specialized SFT is harmful compared to general instruction tuning

Limitations

Study focuses primarily on the Qwen3 model family for controlled experiments
Reinforcement learning is more computationally expensive and unstable than SFT
Analysis relies on existing benchmarks which may have their own biases (e.g., contamination)

Reproducibility

Paper evaluates over 20 open-weight models. Controlled experiments use Qwen3-14B and Qwen3-32B (presumably released in the future relative to current knowledge, or available at time of paper writing). Code URL not provided in snippet. Dataset derivation described but exact subset not linked.

📊 Experiments & Results

Evaluation Setup

Evaluation of 20+ open-weight models and a controlled study (UniReason) across three domain groups.

Benchmarks:

MATH500 (Math reasoning)
AIME24 (Math reasoning (hard))
LiveCodeBench (Coding (Other Reasoning))
IFEval (Instruction Following (Non-reasoning))
GPQA-Diamond (Scientific QA (Other Reasoning))

Metrics:

Accuracy
Transferability Index (TI)
KL Divergence
PCA Shift
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Controlled experiments (UniReason) comparing RL and SFT on the exact same math data show RL consistently outperforming SFT on math tasks.
AIME24	Accuracy	Not reported in the paper	55.7	Not reported in the paper
MATH500	Accuracy	Not reported in the paper	87.8	Not reported in the paper
Transferability results show RL generalizing to other reasoning domains significantly better than SFT.
GPQA-Diamond	Accuracy	55.9	57.7	+1.8
LiveCodeBench	Accuracy	Not reported in the paper	Not reported in the paper	+17.1
Latent space and token distribution analysis reveals SFT causes much larger shifts away from the base model than RL.
IFEval	KL Divergence	0.283	0.019	-0.264
MATH500	KL Divergence	0.372	0.084	-0.288
Average across tasks	Token Rank Shift	10.6	0.98	-9.62

Experiment Figures

Transferability Index (TI) scores for various models across Math, Other, and Non-Reasoning domains

PCA latent space shifts and KL divergence measurements

Word clouds of tokens with significant probability shifts

Main Takeaways

Optimization paradigm is the dominant factor in transferability: RL preserves general capabilities while SFT degrades them, even when using the same data.
SFT induces 'catastrophic forgetting' on non-math tasks due to significant latent space drift and global token distribution shifts.
RL updates are more 'surgical', modifying only task-relevant tokens (e.g., 'define', 'add') while leaving general language tokens stable.
On-policy sampling in RL is critical; it aligns training with the model's own evolving distribution, reducing mismatch compared to off-policy SFT.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model post-training methods (SFT, RLHF)
Principal Component Analysis (PCA)
KL Divergence
Mathematical reasoning benchmarks (MATH, AIME)

Key Terms

SFT: Supervised Fine-Tuning—training a model to mimic specific target outputs (like reasoning traces) given inputs

RL: Reinforcement Learning—training a model to maximize a reward signal (e.g., correct answer) rather than just mimic text

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to stabilize training

Transferability Index: A metric proposed in this paper to quantify how well improvements in one domain (math) translate to gains in others (coding, general QA)

PCA shift: A measure of how much a model's internal hidden states change directions in feature space after training

KL divergence: Kullback-Leibler divergence—a statistical metric measuring how different two probability distributions (e.g., token predictions) are from each other

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer

On-policy: Training where the model learns from data generated by its current version (common in RL)

Off-policy: Training where the model learns from static data generated by a previous or different model (common in SFT)

OlympiadBench: A challenging benchmark consisting of Olympiad-level mathematics and physics problems