Kimi k1.5: Scaling Reinforcement Learning with LLMs

📝 Paper Summary

Reinforcement Learning for LLMs Reasoning

Kimi k1.5 scales reinforcement learning by utilizing long context windows up to 128k tokens, employing partial rollouts for efficiency, and simplifying policy optimization to achieve state-of-the-art reasoning without complex search trees or value functions.

Core Problem

Standard pretraining is limited by the availability of high-quality static data, and prior RL attempts with LLMs often failed to produce competitive results or relied on overly complex engineering like Monte Carlo Tree Search.

Why it matters:

Scaling compute via next-token prediction is hitting data bottlenecks, requiring a new axis for continued intelligence improvement.
Existing reasoning models often rely on expensive search algorithms (MCTS) or complex value functions that are hard to scale and deploy.
Effective long-chain reasoning is critical for solving advanced math, coding, and multimodal problems where simple prompting fails.

Concrete Example: In complex math problems, a model might guess a correct answer through incorrect reasoning (false positive). Kimi k1.5 avoids this by enforcing long Chain-of-Thought verification and penalizing short, lucky guesses, unlike standard RL which might exploit such rewards.

Key Novelty

Long-Context RL with Partial Rollouts

Treats the reasoning process as a single very long sequence (up to 128k context) rather than a tree, allowing the model to learn planning, reflection, and error correction implicitly within the context.
Uses 'partial rollouts' to reuse previous trajectory segments during training, enabling efficient training on extremely long reasoning paths without re-generating from scratch every step.
Simplifies RL by removing value networks and Monte Carlo Tree Search, relying instead on a robust variant of online mirror descent with length penalties.

Architecture

The RL training system architecture and the Partial Rollout mechanism.

Evaluation Highlights

Matches OpenAI o1 on reasoning benchmarks: 77.5 on AIME, 96.2 on MATH 500, and 94th percentile on Codeforces.
Achieves 74.9 on MathVista, demonstrating strong multimodal reasoning capabilities.
Short-CoT distilled model outperforms GPT-4o and Claude Sonnet 3.5 significantly (e.g., +550% relative improvement on some short-CoT metrics like AIME 60.8 vs lower baselines).

Breakthrough Assessment

9/10

Demonstrates that simple RL techniques scaled to massive contexts can match complex proprietary systems like o1, effectively demystifying 'reasoning' models and offering a reproducible recipe for scaling test-time compute.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning on reasoning tasks where the policy generates a chain-of-thought z and answer y given problem x.

Inputs: Problem x (text or multimodal)

Outputs: Long Chain-of-Thought z followed by final answer y

Pipeline Flow

Prompt Curation (Filtering & Difficulty tagging)
Long-CoT SFT (Warmup)
RL Training (Iterative Rollout & Optimization)
Long2Short Distillation (Optional for deployment)

System Modules

Policy Model

Generate long chain-of-thought and final answers

Model or implementation: Kimi k1.5 (Transformer-based Multimodal LLM)

Reward Model

Evaluate correctness of final answer y (and potentially reasoning process z)

Model or implementation: Fine-tuned Kimi model (Chain-of-Thought RM)

Rollout Worker

Generate trajectories using the current policy

Model or implementation: vLLM inference engine

Novel Architectural Elements

Partial Rollout mechanism: Decouples trajectory generation into segments, storing intermediate states in a replay buffer to enable training on 128k context lengths without full re-generation
Removal of Value Network: Sole reliance on policy gradients with outcome rewards and length penalties, discarding the standard Actor-Critic architecture for a simpler policy-only framework

Modeling

Base Model: Kimi k1.5 (Multimodal LLM)

Training Method: Online Policy Mirror Descent (variant)

Objective Functions:

Purpose: Maximize expected reward while staying close to the reference policy.

Formally: minimize D_KL(pi || pi_ref) - E[r(x,y)]
Purpose: Penalize excessive length to prevent overthinking.

Formally: r_len = -lambda * (len(response) - min_len) / (max_len - min_len) if correct

Adaptation: Full model update

Training Data:

RL Prompt Set: Filtered for diverse coverage, balanced difficulty, and accurate evaluability
SFT Data: ~1M text examples (500k QA, 200k coding, 200k math) + 1M text-vision examples

Key Hyperparameters:

context_window: 128k
sampling_temperature: Not explicitly reported in the paper
learning_rate: Decays from 2e-5 to 1e-6 (SFT stage)
+ 1 more
length_penalty_weight: Adaptive (warmup)

Compute: Hybrid deployment system using Megatron for training and vLLM for inference, switching weights in <1 minute

Comparison to Prior Work

vs. OpenAI o1: Kimi k1.5 achieves matching performance using a 'simplistic' framework without MCTS or process reward models, relying purely on long-context scaling and policy optimization.
vs. AlphaZero: Does not use a value function or explicit tree search; implicit search happens in the linear token sequence.
vs. STaR [not cited in paper]: Uses online mirror descent and partial rollouts rather than simple iterative fine-tuning.

Limitations

Relies heavily on verifiable problems (math/code) where ground truth is available for rewards.
Computationally expensive inference due to extremely long generated chains of thought.
Verification of reasoning steps remains a challenge; prone to 'false positive' verification on simpler questions.

Reproducibility

No public code or weights provided. The paper describes the methodology, data recipes, and infrastructure in detail but does not release artifacts.

📊 Experiments & Results

Evaluation Setup

Zero-shot or few-shot evaluation on standard reasoning benchmarks across Math, Code, and Vision-Language tasks.

Benchmarks:

AIME (Math Competition)
MATH 500 (Math Problem Solving)
Codeforces (Competitive Programming)
MathVista (Visual Math Reasoning)
LiveCodeBench (Code Generation)

Metrics:

Accuracy (Pass@1)
Percentile Rank
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Kimi k1.5 (Long-CoT) matches or exceeds state-of-the-art models like OpenAI o1 on difficult reasoning benchmarks.
AIME	Accuracy	44.6	77.5	+32.9
MATH 500	Accuracy	85.5	96.2	+10.7
Codeforces	Percentile	89.0	94.0	+5.0
MathVista	Accuracy	73.9	74.9	+1.0
Long2Short distilled models significantly outperform standard Short-CoT models like GPT-4o.
AIME	Accuracy	9.3	60.8	+51.5
MATH 500	Accuracy	74.6	94.6	+20.0
LiveCodeBench	Accuracy	38.9	47.3	+8.4

Experiment Figures

Performance comparison of Kimi k1.5 (Long/Short) vs OpenAI o1/GPT-4o/Claude 3.5 Sonnet across multiple benchmarks.

Main Takeaways

Scaling context length in RL allows the model to implicitly learn planning, reflection, and correction, removing the need for explicit search algorithms like MCTS.
The 'Partial Rollout' technique is critical for efficiency, enabling training on very long contexts by reusing trajectory segments.
Long-CoT capabilities can be effectively distilled into Short-CoT models (Long2Short), yielding massive gains over standard models (up to +550% on AIME) while keeping inference costs lower.
A simplistic RL framework without value functions or process reward models is sufficient to achieve SOTA reasoning performance if data quality and context scaling are handled correctly.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient, PPO)
Language Model Pretraining & SFT
Chain-of-Thought (CoT) Reasoning

Key Terms

Long-CoT: Long Chain-of-Thought—generating extremely detailed, step-by-step reasoning paths (sometimes thousands of tokens) to solve complex problems

Partial Rollout: A training technique where new trajectories are sampled by reusing large chunks of previous trajectories stored in a buffer, avoiding the cost of re-generating the full history

Online Mirror Descent: An optimization algorithm that updates policies by keeping them close to a reference distribution while maximizing rewards, often used for stable RL

SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning

Reward Hacking: When an RL agent exploits loopholes in the reward function (e.g., guessing answers without reasoning) to maximize score without learning the intended task

Process Reward Model: A reward model that evaluates intermediate steps of reasoning rather than just the final outcome

Monte Carlo Tree Search (MCTS): A search algorithm used in decision processes to explore future states; often used in RL but replaced here by long-context implicit search

DPO: Direct Preference Optimization—a method to align models using preference pairs without an explicit reward model

Rejection Sampling: Generating multiple samples from a model and keeping only those that meet a correctness criterion

Value Function: A function estimating the expected future reward from a current state; explicitly removed in this paper's framework