Evaluation Setup
Evaluation on diverse reasoning benchmarks across Math, Code, and Vision-Language tasks.
Benchmarks:
- AIME (Math Competition)
- MATH 500 (Mathematics Problem Solving)
- Codeforces (Competitive Programming)
- MathVista (Visual Math Reasoning)
Metrics:
- Accuracy (Pass@1)
- Percentile Rank
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Kimi k1.5 (Long-CoT) achieves state-of-the-art results on difficult reasoning benchmarks, matching or exceeding OpenAI o1. |
| AIME |
Accuracy |
79.2 |
77.5 |
-1.7
|
| MATH 500 |
Accuracy |
96.4 |
96.2 |
-0.2
|
| Codeforces |
Percentile |
89.0 |
94.0 |
+5.0
|
| MathVista |
Accuracy |
63.8 |
74.9 |
+11.1
|
| long2short distilled models significantly outperform standard short-CoT baselines. |
| AIME |
Accuracy |
9.3 |
60.8 |
+51.5
|
| LiveCodeBench |
Accuracy |
38.9 |
47.3 |
+8.4
|
Main Takeaways
- Scaling context length in RL is a viable alternative to complex planning algorithms; the model learns implicit planning within the token sequence.
- Partial rollouts are essential infrastructure for training on long trajectories (up to 128k) efficiently.
- The 'long2short' techniques (model merging, shortest rejection sampling, DPO) effectively transfer reasoning power to cheaper models.
- Simple outcome-based rewards are sufficient for learning complex reasoning behaviors if the prompt set is high-quality and diverse.