← Back to Paper List

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

Wei Liu, Ruochen Zhou, Yi-Xuan Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He
The Hong Kong University of Science and Technology, City University of Hong Kong, University of Waterloo
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

Efficient Reasoning Chain-of-Thought Compression Reinforcement Learning for LLMs
Laser-D improves reasoning efficiency by using reinforcement learning with a dynamic, difficulty-aware reward function that penalizes unnecessary verbosity while allocating more tokens to harder problems.
Core Problem
Large Reasoning Models (LRMs) suffer from 'over-thinking,' generating unnecessarily long and redundant chains of thought that increase compute costs and latency without always improving accuracy.
Why it matters:
  • LRMs like DeepSeek-R1 can output thousands of tokens for simple math problems, wasting significant computational resources
  • Existing efficiency methods (like hard truncation or static length penalties) typically degrade reasoning accuracy significantly, failing to balance performance and cost
  • Current reward shaping approaches are static and do not adapt to the evolving difficulty of questions or the model's changing capabilities during training
Concrete Example: For a trivial question like '1+1=?', an independently trained LRM might generate repetitive self-reflections ('Let me double check...', 'Wait, is there a trick?') totaling hundreds of tokens. Laser-D trains the model to output the direct answer immediately while reserving long reasoning chains only for complex Olympiad math problems.
Key Novelty
Dynamic Difficulty-Aware Length-Based Step Reward (Laser-D)
  • Replaces continuous length penalties with a 'step function' reward: models get a bonus for being correct AND under a target length, rather than just being pushed to be as short as possible
  • Introduces a difficulty-aware mechanism that automatically assigns larger token budgets to harder questions and tighter budgets to easier ones
  • Dynamically updates these target length budgets during training by monitoring the model's success rate, ensuring constraints evolve as the model gets smarter
Evaluation Highlights
  • +6.1 percentage points accuracy improvement on AIME 2024 for DeepSeek-R1-Distill-Qwen-1.5B compared to the original model
  • Reduces token usage by 63% on AIME 2024 (from ~15,900 to ~5,800 tokens) while simultaneously improving accuracy
  • Achieves the best Pareto-optimal trade-off between accuracy and length across MATH500, AIME, and AMC benchmarks compared to truncation, group-based, and budget-based baselines
Breakthrough Assessment
8/10
Offers a practical, highly effective solution to the 'over-thinking' problem in reasoning models. The simultaneous improvement in accuracy and massive reduction in tokens (efficiency) is a significant result.
×