MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining

📝 Paper Summary

Reasoning LLMs Reinforcement Learning for Reasoning Data Centric AI

MiMo-7B maximizes reasoning potential in a 7B model through a reasoning-dense pre-training data mixture and a reinforcement learning pipeline using verifiable math/code problems with difficulty-weighted rewards.

Core Problem

Most current reasoning models rely on large parameters (e.g., 32B+) to achieve strong performance in math and code, and standard pre-training pipelines often filter out or degrade high-density reasoning content like raw code and LaTeX.

Why it matters:

Small models (7B) are often considered too weak to simultaneously master math and code reasoning compared to larger counterparts
Standard heuristic filters in pre-training pipelines inadvertently remove high-value reasoning data (e.g., complex math pages)
Sparse rewards in RL for reasoning (getting a hard problem right) make it difficult to train robust policies without dense signals

Concrete Example: Common pre-training extractors often fail to preserve LaTeX equations or code blocks from web pages, turning a rich math tutorial into broken text. MiMo-7B uses a custom HTML extractor to preserve these, ensuring the model sees valid reasoning patterns.

Key Novelty

Full-Stack Reasoning Optimization (Pre-training + Post-training RL)

Enhances pre-training density by using custom extractors for math/code and a 3-stage data mixture that progressively focuses on STEM content (up to ~70%)
Incorporates Multi-Token Prediction (MTP) during pre-training to encourage planning and accelerate inference via speculative decoding
Applies a 'test difficulty driven code reward' in RL, assigning fine-grained scores based on which test cases pass, providing denser signals than simple pass/fail

Architecture

Comparison of MTP (Multi-Token Prediction) setup during pre-training vs. inference.

Evaluation Highlights

MiMo-7B-RL scores 55.4 on AIME 2025, outperforming OpenAI o1-mini by 4.7 points
MiMo-7B-Base scores 32.9 on LiveCodeBench v5 (Pass@1), significantly outperforming Llama-3.1-8B and Qwen-2.5-7B
RL training on the 7B base model (MiMo-7B-RL-Zero) surpasses the RL performance of a 32B base model on both math and code tasks

Breakthrough Assessment

8/10

Demonstrates that strong reasoning (beating o1-mini) is possible at 7B scale with rigorous data engineering and verifiable RL, challenging the belief that such capabilities require 32B+ parameters.

⚙️ Technical Details

Problem Definition

Setting: LLM pre-training followed by Reinforcement Learning on verifiable reasoning tasks

Inputs: Natural language queries, specifically math and programming problems

Outputs: Step-by-step reasoning chains and final answers (code or math solutions)

Pipeline Flow

Group: Pre-training Pipeline
HTML Extraction & Filtering → 3-Stage Data Mixing → Base Model Training (with MTP)
Group: Post-training Pipeline
SFT (Cold Start) → RL with Verifiable Problems → Seamless Rollout Engine

System Modules

Base Model (Inference / Generation)

Core language model generating reasoning paths

Model or implementation: 7B Transformer (36 layers, 4096 hidden dim)

MTP Heads (Inference / Generation)

Predict multiple future tokens for better planning and speculative decoding speedup

Model or implementation: Single layer during pre-training; replicated to 2 layers for inference

Novel Architectural Elements

Multi-Token Prediction (MTP) integration: Single MTP layer during pre-training, expanded to two parallel MTP layers during inference for speculative decoding

Modeling

Base Model: MiMo-7B (Transformer, 36 layers, 4096 hidden dim, 32 heads, 8 KV groups)

Training Method: Reinforcement Learning (method unspecified, likely PPO or similar iterative policy update)

Objective Functions:

Purpose: Pre-training loss.

Formally: Standard next-token prediction loss + MTP loss (weight 0.3 then 0.1).
Purpose: RL Reward.

Formally: Rule-based accuracy reward (1 for correct, 0 for incorrect).
Purpose: Code RL Reward.

Formally: Test difficulty driven reward based on fine-grained scores of passing test cases.

Adaptation: Full model training

Trainable Parameters: 7B

Training Data:

Pre-training: 25 trillion tokens total
RL: 130K verifiable math and code problems

Key Hyperparameters:

learning_rate: 3e-4 to 1e-5 (decay schedule)
batch_size: Up to 2560 (Stages 1-2), 640 (Stage 3)
context_length: 8,192 (Stages 1-2), 32,768 (Stage 3)
+ 2 more
weight_decay: 0.1
mtp_loss_weight: 0.3 (initial) -> 0.1 (later)

Compute: Not reported in the paper

Comparison to Prior Work

vs. OpenAI o1-mini: MiMo-7B-RL outperforms on AIME 2025 (55.4 vs 50.7 implied)
vs. Llama-3.1-8B: MiMo-7B-Base uses 3-stage data mixing with ~70% STEM content in stage 2, significantly boosting math/code scores
vs. DeepSeek-V3 [not cited in paper]: Both use MTP, but MiMo applies it at 7B scale specifically to enhance reasoning planning and inference speed

Limitations

RL training relies heavily on verifiable problems (math/code), which may not transfer perfectly to open-ended creative reasoning
Requires massive pre-training data (25T tokens) which is computationally expensive to replicate
MTP benefits saturate after one layer during pre-training (though inference uses two)

Reproducibility

Code: https://github.com/xiaomimimo/MiMo

Available: Model checkpoints (Base, SFT, RL) and code at https://github.com/xiaomimimo/MiMo. Missing: Exact RL hyperparameters (PPO clip, KL coeff, etc.) and the 130K RL dataset are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Comprehensive benchmarking on Reasoning, Math, Code, and General Knowledge

Benchmarks:

AIME 2024/2025 (Hard Math Reasoning)
LiveCodeBench (Code Generation)
BBH (General Reasoning)
SuperGPQA (Graduate-Level Science QA)

Metrics:

Pass@1
Pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Base model comparisons show MiMo-7B-Base outperforming other 7B-class models on reasoning-heavy tasks.
BBH	Score	70.2	75.2	+5.0
AIME 2024	Pass@1	Not reported in the paper	32.9	Not reported in the paper
LiveCodeBench v5	Pass@1	Not reported in the paper	32.9	Not reported in the paper
RL tuned model comparisons demonstrate SOTA performance against proprietary models.
AIME 2025	Score	50.7	55.4	+4.7

Experiment Figures

Pass@k performance curves on AIME 2024 and LiveCodeBench for MiMo-7B-Base vs baselines (Llama-3.1-8B, Qwen2.5-7B, 32B baseline).

Long-context performance on Needle-In-A-Haystack (NIAH) and reasoning tasks (Variable Tracking, etc.).

Main Takeaways

MiMo-7B-Base establishes a new SOTA for ~7B base models on reasoning tasks (Math, Code, BBH), validating the reasoning-focused data mixture strategy.
RL training is highly effective even at 7B scale: MiMo-7B-RL outperforms OpenAI o1-mini on AIME 2025, challenging the assumption that only large models benefit significantly from reasoning RL.
The 'Seamless Rollout Engine' and MTP integration enable efficient training and inference, addressing the computational bottleneck of RL reasoning (which requires many samples).

📚 Prerequisite Knowledge

Prerequisites

Language Model Pre-training pipelines
Reinforcement Learning (RL) for LLMs
Speculative Decoding

Key Terms

MTP: Multi-Token Prediction—a training objective where the model predicts multiple future tokens at once, encouraging planning and enabling faster inference

Pass@k: A metric that considers a problem solved if at least one correct solution is found among k generated samples

Speculative Decoding: An inference technique where a small model (or MTP head) drafts tokens quickly, which are then verified by the main model

RL: Reinforcement Learning—training a model by rewarding desired behaviors (correct answers) rather than just imitating data

SFT: Supervised Fine-Tuning—training on labeled examples to teach the model instruction following before RL

MinHash: A technique for estimating the similarity of datasets to detect and remove duplicates

RoPE: Rotary Positional Embedding—a method for encoding position information in Transformer models, allowing better extrapolation to longer sequences

GQA: Grouped-Query Attention—an efficiency technique in Transformers where multiple query heads share key/value heads to save memory

SwiGLU: A gated activation function used in modern LLMs for better performance

vLLM: A high-throughput library for LLM inference and serving

NIAH: Needle-In-A-Haystack—a test of long-context capability where a specific fact is hidden in a large amount of unrelated text