Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

📝 Paper Summary

Reasoning Models Reinforcement Learning (RL) for LLMs

Seed1.5-Thinking is a Mixture-of-Experts reasoning model optimized via large-scale reinforcement learning with novel process verifiers, achieving state-of-the-art performance on math and coding benchmarks.

Core Problem

Training high-quality reasoning models is difficult due to the scarcity of high-quality Chain-of-Thought (CoT) data and the extreme instability of large-scale Reinforcement Learning (RL) training.

Why it matters:

Current reasoning models often rely on unstable RL training that crashes frequently, with score differences up to 10 points between runs
Standard rule-based verifiers for math problems struggle with format variations (e.g., 2^19 vs 524288) and corner cases, leading to inaccurate reward signals
Existing benchmarks like AIME 2024 are becoming saturated and lack sufficient discrimination for top-tier models

Concrete Example: A standard verifier might reject a correct answer formatted as '2^{19}' if the reference is '524288', causing the model to learn incorrect behaviors. Seed1.5-Thinking uses a 'Thinking-Verifier' that reasons through the equivalence of these answers before judging.

Key Novelty

Seed1.5-Thinking with VAPO/DAPO RL and Seed-Thinking-Verifier

Integrates two novel RL frameworks (VAPO for actor-critic, DAPO for policy-gradient) to stabilize the notoriously unstable training of reasoning models
employs a 'Seed-Thinking-Verifier' that generates its own reasoning path to judge student answers, reducing reward hacking and handling complex format variations better than rule-based checkers
Decouples the RL infrastructure into an asynchronous streaming rollout architecture with prioritized sample pools to improve iteration speed by 3x

Evaluation Highlights

Achieves 86.7% on AIME 2024, matching o3-mini-high and significantly outperforming DeepSeek R1 and o1
Surpasses DeepSeek R1 by 8.0% in user positive feedback on non-reasoning tasks, indicating strong generalization beyond just math/code
Attains 55.0% pass@1 on Codeforces (based on recent 12 contests), outperforming DeepSeek R1

Breakthrough Assessment

8/10

Strong performance matching or beating current SOTA (DeepSeek R1, o1) on key benchmarks with a smaller model (20B active params). Introduces significant infrastructure and verifier improvements.

⚙️ Technical Details

Problem Definition

Setting: Large-scale Reinforcement Learning on Language Models for reasoning tasks (Math, Code, Logic)

Inputs: Prompt requiring complex reasoning (e.g., math problem, coding task)

Outputs: Chain-of-thought reasoning path followed by a final answer

Pipeline Flow

Data Curation (STEM, Code, Logic, Non-verifiable)
SFT (Chain-of-Thought data)
RL Training (VAPO/DAPO frameworks)
Inference (Reasoning → Output)

System Modules

Base Model

Generate reasoning paths and answers

Model or implementation: Mixture-of-Experts (MoE) with 20B activated / 200B total parameters

Seed-Verifier (Reward Mechanism)

Rule/LLM-based verification of answer equivalence

Model or implementation: LLM-based judge

Seed-Thinking-Verifier (Reward Mechanism)

Reasoning-based verification for complex cases

Model or implementation: Trained verifier model that generates reasoning path

Novel Architectural Elements

Decoupled streaming rollout architecture processing partial trajectories asynchronously
Prioritized sample pools for 3x faster iteration cycles

Modeling

Base Model: Seed1.5 (MoE, 200B total, 20B activated parameters)

Training Method: Reinforcement Learning (RL) using VAPO (Actor-Critic) and DAPO (Policy Gradient)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while maintaining stability.

Formally: VAPO (Variance-Reduced Advantage Policy Optimization) and DAPO (Direct Advantage Policy Optimization) objectives [detailed in cited papers 5, 6]

Training Data:

100k STEM problems (Math, Physics, Chemistry) cleaned via WoN filtering and augmented
Code data with unit tests and checker scripts from competitive programming
10k Logic puzzles (Sudoku, 24-point) with difficulty generators
Non-verifiable data (Creative writing, etc.) filtered by reward variance

Compute: Supports mixed-precision training with automatic fault recovery; streaming rollout architecture 3x faster than synchronous frameworks

Comparison to Prior Work

vs. DeepSeek R1: Seed1.5-Thinking is smaller (20B active vs likely larger dense/MoE) but claims higher pass rates on Codeforces and better non-reasoning generalization
vs. o1: Outperforms o1 on AIME 2024 (86.7 vs significantly lower implied baseline)
vs. Standard RL (PPO) [not cited in paper]: Uses VAPO/DAPO frameworks to address specific instability/crashing issues common in reasoning model RL

Limitations

Still lags behind o3 and Gemini 1.5 Pro on the hardest tasks (BeyondAIME, Codeforces)
Thinking process for the Seed-Thinking-Verifier consumes significant GPU resources
RL training remains computationally expensive and complex despite stability improvements

Reproducibility

Not provided: Model weights and code are not released. New benchmarks (BeyondAIME, Codeforces set) will be publicly released in the future. Model is available for trial via Volcengine link.

📊 Experiments & Results

Evaluation Setup

Evaluation across Mathematical Reasoning, Competitive Programming, Science QA, and General Non-reasoning tasks.

Benchmarks:

AIME 2024 (High-school Math Competition)
BeyondAIME (Hard Math Competition (curated)) [New]
Codeforces (Competitive Programming (recent 12 contests)) [New]
GPQA (Graduate-Level Science QA)

Metrics:

Accuracy (%)
Pass@1
Pass@8
Win Rate (Human Evaluation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AIME 2024	Accuracy	Not reported in the paper	86.7	Not reported in the paper
Codeforces	Pass@1	Not reported in the paper	55.0	Not reported in the paper
GPQA	Accuracy	Not reported in the paper	77.3	Not reported in the paper
Non-reasoning Tasks (Human Eval)	Win Rate improvement	0	8.0	+8.0

Main Takeaways

Seed1.5-Thinking matches o3-mini-high on AIME 2024 (86.7%) and shows strong generalization to coding (55% on Codeforces).
The model generalizes well to non-reasoning tasks, surpassing DeepSeek R1 by 8% in human evaluation.
The new 'BeyondAIME' benchmark reveals a remaining gap between Seed1.5-Thinking and SOTA models like o3 and Gemini 1.5 Pro, despite Seed1.5-Thinking beating o1/R1.
The 'Seed-Thinking-Verifier' is crucial for accurate reward signaling, handling edge cases where standard verifiers fail.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Actor-Critic, Policy Gradient)
Chain-of-Thought (CoT) prompting
Mixture-of-Experts (MoE) architecture
Reward Modeling and Verifiers

Key Terms

VAPO: A novel actor-critic RL framework introduced in this paper to stabilize training

DAPO: A novel policy-gradient RL framework (without critic) introduced in this paper for stable optimization

MoE: Mixture-of-Experts—a model architecture that activates only a subset of parameters (experts) per token to save compute

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training the model on labeled data before RL

Pass@k: A metric measuring the percentage of problems solved where at least one correct solution is found in k attempts

Process Reward Model: A reward model that evaluates the intermediate steps or the final answer reasoning process, rather than just the final output

WoN: Worst of N—a metric used here for data cleaning, where problems are removed if the model gets them right even in its worst attempt

Elo Score: A comparative ranking system often used in competitive programming; this paper avoids it in favor of direct pass rates due to estimation noise