RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

📝 Paper Summary

Post-training for Reasoning LLMs Reinforcement Learning with Verifiable Rewards (RLVR) Mechanistic Analysis of Reasoning

By modeling reasoning as both trajectories and graphs, this study reveals that RL consolidates reasoning into fewer, high-frequency correct paths (squeezing), while SFT diversifies valid strategies (expanding), justifying the two-stage SFT+RL training paradigm.

Core Problem

While RL and SFT are standard for training reasoning LLMs, their specific effects on the underlying reasoning process remain unknown, as evaluations typically rely only on final answer accuracy (Pass@k).

Why it matters:

Current training recipes (SFT followed by RL) are developed through trial-and-error without understanding why they work effectively together.
Pass@k metrics can mask underlying behaviors, such as whether a model is memorizing specific paths or generalizing to new strategies.
Understanding how training alters reasoning topology is crucial for designing better data curation strategies and more efficient post-training methods.

Concrete Example: When an RL model is sampled multiple times, it often collapses to a few specific solution paths (high repetition), whereas an SFT model might generate many diverse valid paths. Without analyzing this 'squeezing' vs 'expanding' behavior, researchers cannot explain why RL improves Pass@1 but degrades Pass@k at large k.

Key Novelty

Trajectory and Step-Level Reasoning Analysis Framework

Trajectory-level: Clusters entire generated reasoning chains to quantify 'unique' correct vs. incorrect paths, revealing how training objectives alter the diversity of solutions.
Step-level: Constructs a 'reasoning graph' where nodes are clustered sentence embeddings and edges are transitions, analyzing topological properties like centrality and modularity to see how reasoning flows change.

Evaluation Highlights

RL steepens the decay rate of node visitation frequency in reasoning graphs by ~2.5x, indicating concentration of reasoning into fewer hub steps.
SFT flattens the decay rate of node visitation frequency to ~1/3, indicating expansion of reasoning across diverse steps.
SFT increases the count of unique correct trajectories (expansion), whereas RL drastically reduces the count of unique incorrect trajectories (squeezing/compression).

Breakthrough Assessment

8/10

Provides the first comprehensive mechanistic explanation for the success of the standard SFT+RL pipeline. The graph-theoretic perspective offers a novel, quantitative way to measure 'reasoning diversity' beyond simple accuracy metrics.

⚙️ Technical Details

Problem Definition

Setting: Post-training analysis of LLMs on mathematical reasoning tasks using trajectory clustering and graph topology metrics.

Inputs: Mathematical problems x from datasets (AIME, AMC)

Outputs: Sampled reasoning traces π consisting of sequences of sentences

Pipeline Flow

Sampling: Generate M responses per problem from Base, SFT, RL, and SFT+RL models
Trajectory Analysis: Compute pairwise similarities (chrF) -> Hierarchical Clustering -> Count unique correct/incorrect paths
Step Analysis: Segment responses into sentences -> Embed sentences -> K-means Clustering (Nodes) -> Construct Reasoning Graph -> Compute Topological Metrics

System Modules

Sampler

Generate reasoning traces from models

Model or implementation: Qwen2.5-Math-7B variants

Sentence Embedder (Graph Construction)

Map individual reasoning steps (sentences) to vector space

Model or implementation: BGE-large-en-v1.5

Graph Constructor (Graph Construction)

Cluster embeddings into nodes and build transition edges

Model or implementation: K-means clustering

Novel Architectural Elements

Unified Reasoning Graph Construction: Joint clustering of sentence embeddings across multiple models (Base, SFT, RL) into a shared node space, enabling direct topological comparison of different training stages

Modeling

Base Model: Qwen2.5-Math-7B

Training Method: Comparative study of standard post-training methods (SFT, RLVR)

Adaptation: Analysis of existing checkpoints (Base, RL, SFT, SFT+RL)

Trainable Parameters: Models are frozen during analysis; sampling only

Key Hyperparameters:

sampling_temperature: 0.6
sampling_top_p: 0.95
response_length: 16000
+ 2 more
clustering_threshold_trajectory: 60 (chrF score)
graph_nodes_K: 2000

Compute: Not reported in the paper

Comparison to Prior Work

vs. Pass@k Analysis: Pass@k aggregates binary outcomes; this work decomposes performance into trajectory diversity and graph topology
vs. Thought Process Analysis (e.g., Wang et al. 2024): Prior work analyzed token/hidden states within a single model; this work embeds sentences into a shared space across 4 models to compare topologies directly
vs. ProRL / AceReason: These are SFT+RL systems; this paper analyzes *why* they work rather than proposing a new training algorithm

Limitations

Analysis limited to mathematical domain (AIME, AMC); may not generalize to coding or creative writing
Depends on specific embedding model (BGE) and clustering hyperparameters (K=2000); sensitivity analysis is partial
Does not propose a new training method, only analyzes existing ones

Reproducibility

Code availability is not provided. The study uses public models (Qwen2.5-Math-7B, DeepSeek-R1-Distill-Qwen-7B) and public datasets (AIME, AMC). Exact implementation of the graph construction and clustering pipeline is described mathematically but no repo is linked.

📊 Experiments & Results

Evaluation Setup

Sampling-based analysis of reasoning traces on math competition problems

Benchmarks:

AIME 2024 (Mathematics Competition)
AIME 2025 (Mathematics Competition)
AMC 2023 (Mathematics Competition)

Metrics:

Number of unique trajectories (Correct/Incorrect)
Pass@1
Decay rate (β) of node visitation frequency/degree
Graph Modularity
Global Efficiency
Statistical methodology: Linear regression for estimating decay rates

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pass@1 results show the performance hierarchy of the models, confirming SFT+RL is state-of-the-art.
Step-level analysis reveals that RL steepens decay rates (concentrates reasoning), while SFT flattens them (diversifies reasoning).
AIME/AMC Combined	Decay Rate β (Visitation Frequency)	0.001	0.0003	-0.0007
Trajectory-level analysis shows RL reduces unique paths (both correct and incorrect) while SFT increases correct unique paths.
AIME 2024	Unique Correct Trajectories	5	25	+20

Main Takeaways

RL acts as a compressor ('squeezer'): it aggressively prunes incorrect trajectories and consolidates reasoning into a few high-probability 'hub' steps, increasing Pass@1 but reducing diversity.
SFT acts as an expander: it introduces new valid reasoning strategies (more unique correct trajectories) and homogenizes the importance of reasoning steps, creating a more globally connected graph.
The 'SFT then RL' recipe works because SFT first expands the space of correct solutions (which the Base model lacks), and RL then squeezes the distribution to focus probability mass on these correct paths.
Topologically, Base models have high modularity (community structure), which RL transforms into centralized hub-based structures, while SFT creates globally connected graphs with lower modularity.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Supervised Fine-Tuning (SFT) on reasoning traces
Graph theory fundamentals (centrality, modularity, degree distribution)
Clustering algorithms (K-means, Hierarchical clustering)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using outcomes (like correct math answers) as reward signals

Reasoning Graph: A directed graph where nodes represent clusters of semantically similar sentences (steps) and edges represent transitions between them in model outputs

Pass@k: The probability that at least one correct solution is generated when sampling k independent solutions from the model

SFT: Supervised Fine-Tuning—training a model to imitate expert reasoning traces (distillation from strong models like DeepSeek-R1)

chrF: Character n-gram F-score—a metric used here to measure similarity between two reasoning trajectories for clustering

Betweenness Centrality: A measure of a node's importance based on the number of shortest paths that pass through it; high centrality implies a 'hub' or bottleneck step

Modularity: A measure of the structure of a graph, quantifying the strength of division of a graph into modules (clusters, communities)

Decay Rate: The slope β in the exponential law rank plot of graph metrics (frequency, degree); higher β means activity is concentrated in fewer nodes

Graphlet: Small, connected, non-isomorphic induced subgraphs used to characterize local topology (e.g., cycles vs linear paths)