Kimi K1.5: Scaling RL w. LLMs

📝 Paper Summary

Reinforcement Learning for LLMs Reasoning Models Long-Context LLMs

Kimi k1.5 scales reinforcement learning by optimizing long-context Chain-of-Thought generation, treating context length as a compute budget to improve reasoning without complex tree search or process rewards.

Core Problem

Standard pretraining is limited by the availability of static high-quality data, while prior RL attempts on LLMs have struggled to produce competitive results compared to supervised scaling.

Why it matters:

Continued scaling of AI intelligence solely through static data is hitting a ceiling.
Most existing RL approaches rely on complex, brittle components like Monte Carlo Tree Search or dense process reward models which are hard to scale.
Effective RL could allow models to self-improve by exploring and learning from their own successful reasoning paths.

Concrete Example: In complex math problems like AIME, standard models often fail because they lack the ability to 'plan' or 'backtrack'. Kimi k1.5 learns to generate long thought sequences (up to 128k context) that explicitly include trial, error, and correction steps, solving problems where single-pass generation fails.

Key Novelty

Simplistic Long-Context RL Framework

Scales RL context window to 128k tokens, allowing the model to treat 'thinking time' (token length) as a search budget.
Uses 'partial rollouts' to efficiently manage long trajectory generation by reusing segments of previous thoughts, avoiding full re-generation.
Simplifies RL by removing value functions and process reward models (PRMs), relying instead on long-context policy optimization with sparse outcome rewards.

Architecture

The RL training system architecture focusing on the 'Partial Rollout' mechanism.

Evaluation Highlights

Matches OpenAI o1 performance on AIME (77.5) and MATH 500 (96.2).
Achieves 94th percentile on Codeforces, demonstrating strong coding capability.
long2short distillation improves short-CoT models significantly, outperforming GPT-4o and Claude Sonnet 3.5 by large margins (e.g., +550% relative gain on some metrics).

Breakthrough Assessment

9/10

Demonstrates that simple RL techniques scaled to massive context lengths can match complex proprietary systems like OpenAI o1, establishing a viable alternative to process-reward-heavy approaches.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from verifiable outcome rewards (e.g., correct answer, passed tests).

Inputs: Problem statement x (text or multimodal)

Outputs: Long chain-of-thought z and final answer y

Pipeline Flow

Prompt Curation (Difficulty filtering)
Long-CoT SFT (Warmup)
RL Training (Iterative Rollout & Optimization)
Inference (Long-CoT Generation)

System Modules

Prompt Curator

Selects diverse, verifiable prompts and filters out 'easy-to-hack' questions.

Policy Model (Actor)

Generates long CoT trajectories.

Model or implementation: Kimi k1.5 (Multimodal LLM)

Reward Model

Evaluates final answer correctness.

Model or implementation: Chain-of-Thought RM

Novel Architectural Elements

Partial Rollout System: Decouples trajectory generation from monolithic forward passes, allowing extremely long sequences (128k) to be generated and trained on asynchronously without OOM errors.
Hybrid Deployment: Collocates training (Megatron) and inference (vLLM) on the same GPUs, switching between them to maximize utilization.

Modeling

Base Model: Kimi k1.5 (Multimodal LLM)

Training Method: Online Policy Mirror Descent (variant of Policy Gradient)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to reference.

Formally: Gradient of surrogate loss L(θ) = sum[ (w_j * log π(y|x)) ] where w_j depends on exp(r/τ).
Purpose: Penalize excessive token length to prevent 'overthinking'.

Formally: Length reward r_len based on relative length of correct responses.

Training Data:

Prompt set filtered for difficulty and verifiability.
Warmup SFT data generated via 'rejection sampling' style prompt engineering for long CoT.

Key Hyperparameters:

context_length: 128k
sampling_temperature: High (for difficulty estimation)
N_attempts: 8 (for hack-prevention)

Compute: Not reported in the paper

Comparison to Prior Work

vs. OpenAI o1: Achieves matching performance using a simpler framework without Monte Carlo Tree Search or value functions.
vs. DeepSeek-R1 [not cited in paper]: Similar focus on long-CoT via RL, but Kimi emphasizes 'Partial Rollout' infrastructure and specific 'long2short' distillation techniques.

Limitations

Reliance on verifiable ground truth limits applicability to open-ended tasks.
Long-CoT inference is computationally expensive (high token cost).
Risk of 'overthinking' where the model generates unnecessary tokens if not penalized.

Reproducibility

No public code or weights. Detailed recipes for data curation (difficulty filtering, hack prevention) and system infrastructure (partial rollouts, hybrid deployment) are provided.

📊 Experiments & Results

Evaluation Setup

Evaluation on diverse reasoning benchmarks across Math, Code, and Vision-Language tasks.

Benchmarks:

AIME (Math Competition)
MATH 500 (Mathematics Problem Solving)
Codeforces (Competitive Programming)
MathVista (Visual Math Reasoning)

Metrics:

Accuracy (Pass@1)
Percentile Rank
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Kimi k1.5 (Long-CoT) achieves state-of-the-art results on difficult reasoning benchmarks, matching or exceeding OpenAI o1.
AIME	Accuracy	79.2	77.5	-1.7
MATH 500	Accuracy	96.4	96.2	-0.2
Codeforces	Percentile	89.0	94.0	+5.0
MathVista	Accuracy	63.8	74.9	+11.1
long2short distilled models significantly outperform standard short-CoT baselines.
AIME	Accuracy	9.3	60.8	+51.5
LiveCodeBench	Accuracy	38.9	47.3	+8.4

Experiment Figures

Radar charts comparing Kimi k1.5 (Long-CoT and Short-CoT) against OpenAI o1 and GPT-4o across 6 benchmarks.

Main Takeaways

Scaling context length in RL is a viable alternative to complex planning algorithms; the model learns implicit planning within the token sequence.
Partial rollouts are essential infrastructure for training on long trajectories (up to 128k) efficiently.
The 'long2short' techniques (model merging, shortest rejection sampling, DPO) effectively transfer reasoning power to cheaper models.
Simple outcome-based rewards are sufficient for learning complex reasoning behaviors if the prompt set is high-quality and diverse.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient)
Chain-of-Thought (CoT) Prompting
Large Language Model Architecture

Key Terms

Long-CoT: Chain-of-Thought reasoning that extends to thousands of tokens, exhibiting planning, reflection, and error correction.

Partial Rollouts: A system technique where long sequences are generated in chunks; if a sequence exceeds a budget, it is paused and resumed later, reusing the memory state.

Process Reward Model (PRM): A reward model that evaluates each step of reasoning; Kimi k1.5 avoids this in favor of outcome-based rewards.

Online Mirror Descent: An optimization algorithm used here for policy updates, regularizing the new policy to stay close to the previous one.

long2short: Techniques to distill the reasoning capabilities of a computationally expensive long-CoT model into a more efficient short-CoT model.