← Back to Paper List

Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought

Qiguang Chen, Libo Qin, Jiaqi Wang, J. Zhou, Wanxiang Che
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, School of Computer Science and Engineering, Central South University, The Chinese University of Hong Kong
Neural Information Processing Systems (2024)
Reasoning Benchmark

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Model Evaluation and Benchmarking
The paper introduces a Reasoning Boundary Framework (RBF) to quantify the upper limits of Chain-of-Thought capabilities and optimize performance by aligning prompting strategies with problem difficulty.
Core Problem
Current research lacks quantitative metrics to assess the upper-bound capabilities of Chain-of-Thought (CoT) and provides little guidance on how to optimize CoT strategies based on these limitations.
Why it matters:
  • Existing studies offer only qualitative assessments (e.g., CoT is limited by demonstration logic), hindering objective comparison of CoT approaches.
  • Without knowing where a model's reasoning boundary lies, researchers cannot effectively design actionable optimization strategies to push those boundaries.
  • Understanding performance drop-offs is critical for deploying LLMs in complex reasoning tasks where reliability is paramount.
Concrete Example: In arithmetic multiplication, a model might have >90% accuracy for results up to 2.2e5 but drop to <10% for results exceeding 2e6. Standard evaluation metrics averaging performance across all difficulties obscure this sharp 'cliff' in capability.
Key Novelty
Reasoning Boundary Framework (RBF) & Combination Law
  • Defines 'Reasoning Boundary' (RB) as the maximum problem difficulty where model accuracy meets a specific threshold (e.g., 90% or 10%).
  • Proposes a 'Combination Law' modeling complex task performance as the weighted harmonic mean of fundamental sub-capabilities (e.g., planning and calculation).
  • Categorizes problem space into Completely Feasible (CFRB), Partially Feasible (PFRB), and Completely Infeasible (CIRB) to guide distinct optimization strategies for each zone.
Architecture
Architecture Figure Figure 1
Concept of Reasoning Boundary (RB), the Combination Law, and the three RB Categories (CFRB, PFRB, CIRB).
Evaluation Highlights
  • Minimum Acceptable Reasoning Path (MARP) prompting achieves state-of-the-art results on GSM8K and BigGSM compared to 10 other CoT strategies.
  • Validates the Combination Law across 27 models and 5 tasks, showing complex math reasoning boundaries align with the harmonic mean of planning and calculation capabilities.
  • Identifies three distinct performance zones: >90% accuracy (CFRB), <10% accuracy (CIRB), and a transition zone (PFRB) requiring consensus-building strategies.
Breakthrough Assessment
7/10
Provides a novel quantitative framework for understanding CoT limits and a verified 'law' for composing capabilities. The resulting optimization strategy (MARP) is effective, though the core contribution is the theoretical framework for quantification.
×