← Back to Paper List

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang
University of Science and Technology of China, ByteDance SEED, Shenzhen University of Advanced Technology
arXiv (2026)
Reasoning Benchmark Factuality

📝 Paper Summary

Evaluation of Reasoning Models Chain-of-Thought (CoT) Efficiency
CoTJudger evaluates reasoning efficiency by converting linear Chain-of-Thought text into directed dependency graphs to identify the Shortest Effective Path (SEP) and quantify structural redundancy.
Core Problem
Large Reasoning Models (LRMs) often engage in 'over-reasoning' (circular verification, redundant steps), but existing evaluations rely on coarse token counts that cannot distinguish necessary complexity from structural waste.
Why it matters:
  • Extended reasoning increases computational inference costs significantly without always improving outcomes
  • Current metrics risk optimizing for token volume rather than reasoning quality
  • Distilled models often mimic the verbosity of larger teachers without the associated reasoning rigor, creating a 'reasoning illusion'
Concrete Example: DeepSeek-R1-0528-Qwen3-8B averages 8,817 tokens per query, yet the core reasoning (Shortest Effective Path) requires only 7–47 steps. The model spends >80% of compute on loops and self-correction that could be pruned.
Key Novelty
Graph-Driven Reasoning Topology Analysis
  • Maps free-form text CoTs into directed dependency graphs where nodes are atomic steps and edges represent logic (verification, backtracking, advancement)
  • Algorithmically extracts the Shortest Effective Path (SEP)—the minimal subgraph needed to derive the correct answer—to calculate a Redundancy Ratio
Evaluation Highlights
  • Qwen3-Max exhibits a Redundancy Ratio of 86.5%, spending the vast majority of its inference budget on non-essential steps
  • DeepSeek-R1 (teacher) shows high cyclic complexity with an Average Degree of ~1.75 and Redundancy Ratio of 78.0%
  • Distilled LRMs consistently exceed 69% Redundancy Ratio, often inheriting structural bloat from teachers without their verification capability
Breakthrough Assessment
8/10
Provides the first automated, structure-aware framework to disentangle reasoning length from reasoning utility, exposing the 'illusion of depth' in distilled reasoning models.
×