Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

📝 Paper Summary

LLM Reasoning Reinforcement Learning (RL) Scaling Inference Scaling

T1 scales reinforcement learning for reasoning by initializing with trial-and-error CoT data, oversampling with high temperature during training, and demonstrating that longer single-generation thinking directly improves performance.

Core Problem

Current reasoning approaches rely on imitation learning or repeated sampling with verifiers, which fail to fundamentally improve the policy model's ability to self-explore or scale performance with longer thinking.

Why it matters:

Repeated sampling relies on external verifiers and does not improve the core model's reasoning capabilities.
Existing RL attempts yield modest improvements in complex reasoning compared to imitation learning stages.
True inference scaling requires models to 'think longer' effectively rather than just selecting from many short attempts.

Concrete Example: A standard SFT model might output a correct answer through a memorized shortcut. However, when faced with a complex math problem, it fails to self-correct if the first step is wrong. T1 explicitly learns to say 'Wait, perhaps...' or 'Let's try a different approach', recovering from errors within a single long generation.

Key Novelty

Scaled RL with Exploration and Long-Thinking Inference

Initializes the policy with synthetic CoT data that explicitly includes reflection, trial-and-error, and self-verification, rather than just perfect reasoning paths.
Scales RL training by oversampling (K=64) with high temperature to force exploration, stabilized by penalties for repetition and garbage text.
Analyzes inference scaling by truncating single long reasoning chains, showing that increased token budgets directly correlate with accuracy without external verifiers.

Architecture

The overall T1 training pipeline: SFT initialization with CoT data (Attempt/Reflect/Answer), followed by scaled RL training.

Evaluation Highlights

Outperforms QwQ-32B-Preview on MATH500 (92.4% vs 90.6%) and AIME 2024 (50.6% vs 50.0%) using Qwen2.5-32B base.
Achieves +25.7% accuracy gain on AIME (24.9% → 50.6%) via RL scaling compared to its own SFT baseline.
Demonstrates inference scaling: Extending reasoning length consistently improves AIME accuracy from ~24% to 50% as thinking tokens increase from 2k to 6k.

Breakthrough Assessment

9/10

Significantly advances open-source reasoning by replicating o1-like inference scaling behaviors using standard RL techniques (not proprietary algorithms) and open weights.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from Human Feedback (RLHF) applied to mathematical reasoning tasks.

Inputs: Math problem prompt x

Outputs: Reasoning chain y (potentially long, with self-corrections) and final answer

Pipeline Flow

Synthesize CoT Data (Trial-and-Error) -> SFT Initialization
RL Training (Oversampling + Penalties) -> T1 Policy
Inference (Single Long Generation) -> Answer

System Modules

SFT Initializer (Training)

Initialize policy with rich reasoning patterns including reflection and verification

Model or implementation: Qwen2.5 / GLM-4 base models

RL Trainer (Training)

Scale reasoning via exploration and optimization

Model or implementation: Policy Model (updated via RLOO)

Novel Architectural Elements

Inference Scaling measurement strategy: Truncating single long responses to different lengths to simulate varying inference budgets, rather than repeated sampling.

Modeling

Base Model: Qwen2.5-32B, Qwen2.5-14B, GLM-4-9B

Training Method: Reinforcement Learning (variant of REINFORCE/RLOO)

Objective Functions:

Purpose: Maximize reward while staying close to reference.

Formally: J_r(π_θ) = E[r(x,y) - β log(π_θ(y|x)/π_ref(y|x))]
Purpose: Normalize rewards using Leave-One-Out baseline.

Formally: r_bar_i = r_i - (1/(K-1)) * sum_{j!=i} r_j
Purpose: Encourage token diversity via entropy.

Formally: L = L_RL + α * H(π(·|x))

Training Data:

Synthesized CoT data: Generated multiple attempts from LLMs, used a critic to identify errors/verify correctness, then rewritten into a single 'Attempt-Reflect-Answer' path.

Key Hyperparameters:

sampling_n_responses_K: 64
sampling_temperature: high (e.g., > 1.0, specifically tested 0.9-1.3)
penalty_reward: -1 (for repetition, overlong text, garbage)
+ 1 more
top_p: 0.95

Comparison to Prior Work

vs. Repeated Sampling: T1 improves the policy itself to think longer in a single chain, rather than selecting from multiple short samples.
vs. QwQ-32B-Preview: T1 achieves higher accuracy on AIME and MATH500 using the same base model size.
vs. Standard RLHF: T1 uses oversampling (K=64), high temperature, and specific CoT structures (trial-and-error) to force exploration, whereas standard RL often collapses or improves marginally.

Limitations

RL training is sensitive to sampling parameters; low temperatures cause collapse, high temperatures risk instability.
Requires synthesized CoT data with specific trial-and-error structures for SFT initialization.
Computational cost of oversampling (K=64) during training is high compared to standard K=8 setups.

Reproducibility

Code: https://github.com/THUDM/T1

Model weights and SFT/RL training data are publicly available at https://github.com/THUDM/T1. Code availability is implied but specific training scripts are not explicitly detailed in the text beyond the method description.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning benchmarks evaluated using Accuracy (Pass@1).

Benchmarks:

MATH500 (Competition-level math problems)
AIME 2024 (High-difficulty math competition)
Omni-MATH-500 (Olympiad-level mathematics)
GPQA (Graduate-level science QA (OOD generalization))

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
T1 (Qwen2.5-32B) outperforms strong baselines including QwQ and proprietary models on math benchmarks.
MATH500	Accuracy	90.6	92.4	+1.8
AIME 2024	Accuracy	24.9	50.6	+25.7
Omni-MATH-500	Accuracy	46.6	49.6	+3.0
GPQA	Accuracy	49.5	56.1	+6.6
Ablation studies show the impact of sampling diversity (K) during training.
MATH500	Accuracy	83.0	86.0	+3.0

Experiment Figures

Training and inference scaling curves on AIME2024. X-axis is max generation length (inference budget), Y-axis is accuracy. Different lines represent different RL training steps.

Inference scaling performance on AIME, Omni-MATH, and MATH500 using truncated thinking.

Main Takeaways

Inference Scaling is real: Longer reasoning chains (thinking tokens) directly correlate with higher accuracy on hard tasks (AIME), even when artificially truncated.
RL Scaling requires exploration: Simply training with low K or low temperature yields minimal gains. High K (64) and high temperature (>1.0) are crucial.
Trial-and-error initialization matters: SFT data must contain mistakes and corrections to seed the exploration capability for RL.
Penalties prevent collapse: Without penalties for repetition/garbage, high-temperature RL training tends to produce long, nonsensical outputs.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/RLOO)
Chain-of-Thought (CoT) prompting
KL Divergence
Language Model alignment

Key Terms

Inference Scaling: The property where increasing the compute budget at test time (e.g., generating more tokens) leads to better performance.

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer.

RLOO: Reinforce Leave-One-Out—a gradient estimator for RL that uses the average reward of other samples as a baseline to reduce variance.

SFT: Supervised Fine-Tuning—training the model on labeled examples (here, synthetic reasoning paths) before RL.

KL divergence: A measure of how much the RL policy deviates from the reference model; used as a penalty to prevent the model from outputting gibberish to hack rewards.

Trial-and-error CoT: Training data that includes incorrect attempts and subsequent corrections, teaching the model to recover from mistakes.

Entropy bonus: An auxiliary loss term added to RL to encourage the model to output diverse tokens and prevent collapse into repetitive patterns.