Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

📝 Paper Summary

Reinforcement Learning with Verifiable Reward (RLVR) Long Chain-of-Thought (CoT) Reasoning Reasoning Model Distillation

TFPI introduces a training stage where distilled reasoning models are optimized on input queries stripped of thinking tokens, reducing rollout costs while improving subsequent slow-thinking RL performance.

Core Problem

Training large reasoning models (LRMs) via RLVR requires processing extremely long Chain-of-Thought contexts during rollouts, leading to massive computational costs and potential performance degradation if contexts are shortened too aggressively.

Why it matters:

Generating long Chains-of-Thought during RL training incurs substantial compute expenses (e.g., 8K H800 hours for a 4B model)
Starting RL training with overly short contexts to save compute often causes irreversible drops in reasoning accuracy
Current multistage training strategies are still computationally heavy and may not fully mitigate performance loss

Concrete Example: When training Qwen-3-4B using standard DAPO (a direct RL method) with a restricted 4K response length, accuracy on AIME25 drops significantly (reduces avg@32 by >40%). In contrast, applying the ThinkingFree operation allows the model to improve accuracy by ~2% under the same constraints.

Key Novelty

Thinking-Free Policy Initialization (TFPI)

Initializes the RL policy by training on 'Thinking-Free' queries—inputs where the thinking content is explicitly discarded via a special token format (Thinking-Free operation)
Forces the model to learn efficiently from short-context rollouts before transitioning to full long-CoT reasoning, acting as a bridge between distillation and standard RLVR
Demonstrates that training in this thinking-free mode enhances the model's capability in the original slow-thinking mode while drastically reducing token consumption

Architecture

Conceptual comparison of TFPI versus standard Long-CoT RL and Multistage RL in terms of compute and performance.

Evaluation Highlights

Achieves 89.0% accuracy on AIME24 with a 4B model using TFPI only (no subsequent RL), consuming less than 4K H20 hours
TFPI+RL boosts AIME25 accuracy for Qwen3-4B from 70.6% to 76.0% (+5.4%) compared to TFPI alone, surpassing Direct RL baselines under matched compute
Reduces training compute significantly: TFPI+RL requires ~1.5K H800 GPU hours for a 4B model to outperform Polaris-4B (which uses ~8K hours)

Breakthrough Assessment

8/10

Offers a highly practical solution to the massive compute costs of training reasoning models. By validating that 'thinking-free' pre-training boosts 'slow-thinking' performance, it challenges the assumption that long-context training is strictly necessary at all stages.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning tasks

Inputs: Query x (optionally transformed to x' via ThinkingFree operation)

Outputs: Response y containing reasoning steps and final answer

Pipeline Flow

Query Transformation (Apply ThinkingFree)
Policy Rollout (Generate responses)
Reward Computation (Outcome-based)
Policy Update (DAPO/GRPO)

System Modules

ThinkingFree Operator

Modifies input queries to suppress reasoning generation

Model or implementation: Deterministic string manipulation

Policy Model

Generates responses based on input queries

Model or implementation: DeepSeek-Distilled-Qwen-1.5B / Qwen3-4B / DeepSeek-Distilled-Qwen-7B

Novel Architectural Elements

Two-stage RL pipeline: Stage 1 (TFPI) optimizes on thinking-free inputs (x') with verifiable rewards; Stage 2 (Standard RLVR) optimizes on original inputs (x) with long-CoT generation initialized from Stage 1

Modeling

Base Model: DeepSeek-Distilled-Qwen-1.5B, Qwen3-4B, DeepSeek-Distilled-Qwen-7B

Training Method: RLVR using DAPO (variant of GRPO)

Objective Functions:

Purpose: Optimize policy to maximize reward while staying close to reference policy.

Formally: TFPI objective J_TFPI(θ) = E[J_RLVR(θ, x')] where x' = ThinkingFree(x)

Training Data:

Polaris-53K dataset (math-specific data)

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 256
temperature: 1
+ 4 more
top_p: 1
top_k: -1
rollouts_per_problem: 8
warm_up: None

Compute: TFPI+RL for 4B model: ~1.5K H800 GPU hours. TFPI only for 4B model: <4K H20 hours.

Comparison to Prior Work

vs. Polaris: TFPI uses a thinking-free initialization stage with much shorter contexts (e.g., 4K-16K) before full RL, reducing compute by >80%
vs. DeepScaleR: TFPI achieves higher performance on DS-1.5B (30.8% vs 28.9% on AIME25) despite using shorter training contexts
vs. Inference-time methods (e.g., AdaptThink): TFPI optimizes the model itself to be efficient without needing specialized length-penalized rewards or dynamic inference routers

Limitations

TFPI training relies solely on math data (Polaris-53K), though it shows some generalization to code/other domains.
Requires an SFT-distilled LRM as a starting point; effectiveness on base models not explicitly analyzed.
Performance gains in out-of-domain tasks (like GPQA) can be inconsistent across model sizes.

Reproducibility

Code availability is not provided. The paper relies on the VeRL codebase. Key hyperparameters and training recipes (DAPO) are specified.

📊 Experiments & Results

Evaluation Setup

Evaluation of reasoning models on math, code, and general reasoning benchmarks using pass@1 accuracy.

Benchmarks:

AIME 2024 / 2025 (Mathematical Reasoning)
BeyondAIME (Mathematical Reasoning)
GPQA-Diamond (Multi-Task Reasoning (Science/Biology/Physics))
LiveCodeBench (Code Generation)
IFEval (Instruction Following)

Metrics:

pass@1 accuracy
Strict Prompt Accuracy (IFEval)
Average Output Length (Tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of TFPI vs. Direct RL under matched training compute shows TFPI consistently yielding higher accuracy.
AIME 25	pass@1 accuracy	24.5	30.8	+6.3
AIME 25	pass@1 accuracy	60.2	63.8	+3.6
AIME 25	pass@1 accuracy	43.0	47.8	+4.8
Comparison of TFPI+RL strategy against Direct RL and other baselines, showing TFPI raises the performance ceiling.
AIME 25	pass@1 accuracy	62.0	76.0	+14.0
AIME 24	pass@1 accuracy	64.0	89.0	+25.0
Efficiency analysis comparing thinking-free inference to other token-reduction methods.
AIME 24	pass@1 accuracy	35.8	37.5	+1.7
AIME 24	Average Tokens	16694	5313	-11381

Experiment Figures

Impact of ThinkingFree on inference tokens and training dynamics.

Main Takeaways

Training with 'Thinking-Free' rollouts (TFPI) improves the model's 'Slow-Thinking' performance, even when using short context lengths that would normally degrade performance.
TFPI is a highly compute-efficient initialization strategy; a 4B model achieves 89.0% on AIME24 with <4K H20 hours, far less than standard RLVR.
Models trained with TFPI generalize well to out-of-domain tasks (Code, Instruction Following) even when trained only on math data.
TFPI creates a Pareto-efficient frontier for reasoning accuracy vs. token usage, outperforming specialized methods like ThinkLess and AdaptThink without complex reward shaping.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Chain-of-Thought (CoT) prompting
Knowledge Distillation in LLMs
Verifiable Rewards (e.g., math/code correctness)

Key Terms

RLVR: Reinforcement Learning with Verifiable Reward—training models using outcomes (like correct/incorrect answers) to guide learning, often encouraging long reasoning chains

ThinkingFree: An operation that transforms a query by appending a token sequence (like </think>) to explicitly discard the thinking/reasoning generation phase, forcing direct answer generation

TFPI: Thinking-Free Policy Initialization—a proposed training stage where the model is optimized using RL on ThinkingFree-transformed queries before standard long-CoT RL

DAPO: A specific RLVR algorithm (variant of GRPO) used in this paper that enables dynamic sampling and clipping

CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer

LRM: Large Reasoning Model—LLMs specifically trained (often via RL) to perform complex reasoning tasks

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on the relative advantage of a group of outputs for the same input, removing the need for a critic model

SFT: Supervised Fine-Tuning—training a model on labeled examples

pass@1: The probability that a single generated solution is correct

rollout: The process of generating model responses during RL training to estimate rewards and gradients

AIME: American Invitational Mathematics Examination—a challenging math competition benchmark

GPQA: A challenging multi-task reasoning benchmark (Graduate-Level Google-Proof Q&A)

LiveCodeBench: A benchmark for evaluating code generation capabilities