L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

📝 Paper Summary

Reasoning Language Models Test-time Compute Scaling Reinforcement Learning

Length Controlled Policy Optimization (LCPO) trains reasoning models to strictly adhere to user-specified chain-of-thought length constraints while maximizing accuracy, enabling precise control over test-time compute.

Core Problem

Current reasoning models (like O1, R1) generate variable-length chains-of-thought without user control, making it impossible to allocate specific test-time compute budgets or prevent wasteful overthinking.

Why it matters:

Uncontrolled models may generate tens of thousands of tokens unnecessarily, wasting substantial compute resources.
Existing solutions like 'budget-forcing' (S1) simply truncate generation or insert stop tokens, which interrupts reasoning and severely degrades performance.
Users cannot currently calibrate inference costs vs. accuracy for real-time applications where latency or cost ceilings are critical.

Concrete Example: When an S1 model reaches its token limit, it inserts a 'Final Answer' token, often forcing the model to guess before solving the problem. In contrast, L1 adapts its strategy to solve the problem within the requested 512 tokens.

Key Novelty

Length Controlled Policy Optimization (LCPO)

Condition the model on a specific target length (e.g., <512 tokens>) explicitly in the prompt during both training and inference.
Train using Reinforcement Learning (GRPO) with a reward function that penalizes deviations from the target length while rewarding correct answers.
This incentivizes the model to learn how to compress or expand its reasoning steps dynamically to fit the budget, rather than just truncating output.

Architecture

Conceptual comparison of standard reasoning (uncontrolled), S1 (truncated/forced), and L1 (adaptive length control).

Evaluation Highlights

+100% relative performance gain over S1 (state-of-the-art budget forcing) on math reasoning tasks at low token budgets (512/1024 tokens).
Outperforms GPT-4o by 2% on average across reasoning benchmarks when restricted to the same generation length, despite being a much smaller 1.5B parameter model.
Achieves ~3% mean error in length adherence across math reasoning datasets, demonstrating high precision in following length constraints.

Breakthrough Assessment

9/10

Solves a critical, unaddressed problem in reasoning models (uncontrollable compute) with a simple, effective RL solution. The result—a 1.5B model beating GPT-4o at equal lengths—is highly significant.

⚙️ Technical Details

Problem Definition

Setting: Reasoning task where model generates answer y with reasoning trace of length n_y given input x and target length n_gold.

Inputs: Input prompt x, target length constraint n_gold

Outputs: Reasoning trace and final answer y satisfying length constraints

Pipeline Flow

Prompt Augmentation (append target length instruction)
Inference (generate reasoning trace + answer)
Evaluation (check correctness and length deviation)

System Modules

Prompt Augmenter

Appends the target length instruction to the user query

Model or implementation: Deterministic rule

Reasoning Generator

Generates the reasoning trace and final answer attempting to meet length constraints

Model or implementation: L1 (based on DeepScaleR-1.5B-Preview)

Novel Architectural Elements

Conditioning mechanism: The prompt explicitly includes a numeric target length which the model is trained via RL to treat as a hard or soft constraint.

Modeling

Base Model: DeepScaleR-1.5B-Preview (based on Qwen-2.5-1.5B)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize correctness while minimizing deviation from exact target length.

Formally: R(y, n_gold) = I(y is correct) - alpha * |n_y - n_gold|
Purpose: Maximize correctness while respecting a maximum length cap (LCPO-Max).

Formally: R(y, n_gold) = I(y is correct) - alpha * relu(n_y - n_gold) + delta * I(y is correct and n_y > n_gold)

Adaptation: Full fine-tuning

Trainable Parameters: 1.5B

Training Data:

DeepScaleR-Preview-Dataset (40K problems from AIME, AMC, Omni-Math, STILL)
Prompts augmented with target lengths sampled uniformly from U(100, 4000)

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 128
context_length_training: 4K
+ 3 more
context_length_eval: 8K
alpha: 0.0003
temperature: 0.6

Compute: Not reported in the paper

Comparison to Prior Work

vs. S1: L1 learns to plan/adapt reasoning to fit budget via RL, whereas S1 truncates/forces answers heuristically.
vs. DeepScaleR-1.5B: L1 adds a length-penalty term to the reward function and conditions on input length constraints.
vs. Standard Instruction Following [not cited in paper]: L1 optimizes for reasoning correctness trade-offs, not just verbosity reduction.

Limitations

Training context length restricted to 4K tokens due to compute constraints (vs 24K for original base model).
L1-Exact assigns equal token budgets to all problems regardless of difficulty.
Performance on knowledge-intensive tasks (MMLU) scales less effectively with length than reasoning tasks.

Reproducibility

Code: https://cmu-l3.github.io/l1

Code and models released at https://cmu-l3.github.io/l1. Dataset (DeepScaleR-Preview-Dataset) is available. Training uses VeRL framework. Hyperparameters provided. Compute resources (GPU type/hours) not explicitly reported.

📊 Experiments & Results

Evaluation Setup

Evaluate accuracy and length adherence across multiple math and reasoning benchmarks at varying target token budgets.

Benchmarks:

AIME 2025 (Math Competition)
MATH (Math Reasoning)
AMC (Math Competition)
Olympiad-Bench (Math Competition)
GPQA (General Reasoning (OOD))
MMLU (General Knowledge (OOD))

Metrics:

Accuracy (Pass@1)
Mean Length Deviation
Budget Violation Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against S1 (state-of-the-art length control baseline) showing massive gains at lower token budgets.
Math Reasoning (Avg)	Accuracy (Relative Gain)	Not reported in the paper	Not reported in the paper	Not reported in the paper
Short Reasoning Models (SRMs) comparison: L1 vs much larger models at equivalent short generation lengths.
Average (MATH, AMC, AIME, etc.)	Accuracy	48.3	50.3	+2.0
Average (MATH, AMC, AIME, etc.)	Accuracy	45.0	50.3	+5.3
Length controllability analysis.
Math Datasets (Avg)	Mean Length Error	Not reported in the paper	0.03	Not reported in the paper

Experiment Figures

Accuracy vs. Average Output Length trade-off curves for L1, S1, and baselines across AIME, MATH, AMC, and Olympiad-Bench.

Generalization to OOD datasets (GPQA, LSAT, MMLU).

Main Takeaways

L1 exhibits a 'log-linear' scaling law where performance improves linearly with the log of the token budget.
L1-Max is more efficient than L1-Exact, matching the performance of unconstrained models often with 2x fewer tokens by not forcing long outputs on simple problems.
The method generalizes to Out-of-Distribution (OOD) tasks like GPQA and LSAT without explicit training, maintaining the linear scaling trend.
Short Reasoning Models (SRMs) capability: L1 effectively distills reasoning patterns into short traces, outperforming both its non-reasoning base and GPT-4o at restricted lengths.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) for LLMs (specifically PPO or GRPO)
Chain-of-thought (CoT) prompting
Test-time compute scaling

Key Terms

LCPO: Length Controlled Policy Optimization—an RL method that updates a model to satisfy both correctness and specific output length constraints.

Chain-of-Thought (CoT): Intermediate reasoning steps generated by a model before producing a final answer.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group averages of sampled outputs rather than a separate value network.

S1: A baseline method that controls length by forcing a 'Wait' or 'Final Answer' token when a budget is reached.

OOD: Out-of-Distribution—tasks or datasets not seen during the training phase.

Test-time compute: The amount of computational resources (tokens generated) used during inference to solve a problem.

DeepScaleR-1.5B-Preview: The base reasoning model used, itself a distilled version of DeepSeek-R1-Distill-Qwen-1.5B.