← Back to Paper List

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo
Microsoft Research Asia
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

LLM Reasoning Reinforcement Learning (RL) Synthetic Data
Logic-RL teaches a 7B language model advanced reasoning skills by training on verifiable synthetic logic puzzles using reinforcement learning, achieving strong generalization to math benchmarks.
Core Problem
Reproducing the emergent reasoning capabilities of models like DeepSeek-R1 is difficult because training codes and datasets are not public, and existing math datasets have uncontrolled complexity that complicates analysis.
Why it matters:
  • Current math datasets (e.g., GSM8K) have high variance in logical depth, making them poor controlled testbeds for studying reasoning dynamics
  • The lack of reproducible frameworks for 'R1-like' reasoning leaves the research community unsure how to replicate emergent behaviors like self-reflection in smaller models
  • Understanding whether reasoning is genuine abstract problem-solving or just superficial pattern matching requires strictly controllable training data
Concrete Example: When trained via standard Supervised Fine-Tuning (SFT), models often memorize surface patterns; if a logic puzzle's variables are simply renamed or reordered, an SFT model's accuracy drops significantly, whereas an RL-trained model maintains performance.
Key Novelty
Logic-RL Framework
  • Uses 'Knights and Knaves' synthetic logic puzzles as a training testbed because they have controllable difficulty and unique, deterministically verifiable ground truth answers
  • Implements a 'Format Reward' that strictly enforces a separation between thinking (<think>) and answering (<answer>) tags to prevent the model from taking shortcuts or guessing
  • Demonstrates that reasoning skills learned in a pure logic domain transfer to completely different domains like mathematics (AIME/AMC)
Evaluation Highlights
  • +125% improvement on AIME (2021-2024) benchmark using Qwen2.5-7B-Instruct trained on only 5k logic puzzles
  • +38% improvement on AMC (2022-2023) benchmark compared to the base model
  • Output length increases 4x (500 to 2000 tokens) over 1k training steps, correlating with the emergence of self-reflection and verification behaviors
Breakthrough Assessment
8/10
Provides a reproducible recipe for 'R1-like' reasoning in small models using synthetic data. The cross-domain generalization from logic puzzles to advanced math is a significant and surprising finding.
×