CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

📝 Paper Summary

Evaluation of Reasoning Models Chain-of-Thought (CoT) Efficiency

CoTJudger evaluates reasoning efficiency by converting linear Chain-of-Thought text into directed dependency graphs to identify the Shortest Effective Path (SEP) and quantify structural redundancy.

Core Problem

Large Reasoning Models (LRMs) often engage in 'over-reasoning' (circular verification, redundant steps), but existing evaluations rely on coarse token counts that cannot distinguish necessary complexity from structural waste.

Why it matters:

Extended reasoning increases computational inference costs significantly without always improving outcomes
Current metrics risk optimizing for token volume rather than reasoning quality
Distilled models often mimic the verbosity of larger teachers without the associated reasoning rigor, creating a 'reasoning illusion'

Concrete Example: DeepSeek-R1-0528-Qwen3-8B averages 8,817 tokens per query, yet the core reasoning (Shortest Effective Path) requires only 7–47 steps. The model spends >80% of compute on loops and self-correction that could be pruned.

Key Novelty

Graph-Driven Reasoning Topology Analysis

Maps free-form text CoTs into directed dependency graphs where nodes are atomic steps and edges represent logic (verification, backtracking, advancement)
Algorithmically extracts the Shortest Effective Path (SEP)—the minimal subgraph needed to derive the correct answer—to calculate a Redundancy Ratio

Evaluation Highlights

Qwen3-Max exhibits a Redundancy Ratio of 86.5%, spending the vast majority of its inference budget on non-essential steps
DeepSeek-R1 (teacher) shows high cyclic complexity with an Average Degree of ~1.75 and Redundancy Ratio of 78.0%
Distilled LRMs consistently exceed 69% Redundancy Ratio, often inheriting structural bloat from teachers without their verification capability

Breakthrough Assessment

8/10

Provides the first automated, structure-aware framework to disentangle reasoning length from reasoning utility, exposing the 'illusion of depth' in distilled reasoning models.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc evaluation of Chain-of-Thought (CoT) traces generated by Large Reasoning Models (LRMs)

Inputs: A query q, a ground truth answer a_gt, and a generated Chain-of-Thought trace T

Outputs: A directed dependency graph G, the Shortest Effective Path (SEP), and efficiency metrics (Redundancy Ratio, Average Degree)

Pipeline Flow

Preprocessing: Heuristic Segmentation → LLM Atomization
Structure Analysis: Node Classification → Graph Construction
Metric Extraction: Path Extraction (SEP) → Calculation

System Modules

Heuristic Segmenter (Preprocessing)

Initial coarse-grained splitting of text by delimiters (line breaks)

Model or implementation: Rule-based

Atomizer (Preprocessing)

Refines segments into atomic reasoning steps (merging fragments, splitting compounds)

Model or implementation: GPT-5 (as described in paper)

Node Classifier (Structure Analysis)

Assigns functional roles (e.g., Problem-Deconstruction, Verification) to each node

Model or implementation: GPT-5

Graph Builder (Structure Analysis)

Connects nodes with directed edges (forward, backward, shortcut) based on logic and repetition

Model or implementation: Algorithmic (Rule-based)

SEP Extractor

Identifies the minimal valid path to the answer

Model or implementation: Search Algorithm + GPT-5 Verifier

Novel Architectural Elements

Conversion of linear CoT text into a non-linear Directed Dependency Graph to capture backtracking and loops explicitly
Definition and extraction of the 'Shortest Effective Path' (SEP) as a ground-truth-aware subset of the reasoning trace

Comparison to Prior Work

vs. OptimalThinkingBench: CoTJudger uses structural graph analysis rather than coarse token counts
vs. PRMBench: CoTJudger is automated and unsupervised (no human annotation needed) and focuses on topology vs. step correctness
vs. CRV (Circuit-based Reasoning Verification): CoTJudger evaluates external text traces (black-box) rather than internal attention circuits (white-box)
+ 1 more
vs. Step-wise Reward Models (e.g., Math-Shepherd) [not cited in paper]: CoTJudger assesses the *necessity* of a step (redundancy), not just its correctness

Limitations

Relies on a high-performance LLM (GPT-5) for atomization and verification, creating a dependency on proprietary models
Graph construction rules are heuristic-based (though data-driven), potentially missing subtle implicit dependencies
Processing cost is high due to multiple LLM calls per CoT (segmentation, classification, verification of multiple paths)
Analysis assumes the 'Shortest' effective path is the ideal, which might penalize exploratory reasoning that is helpful for human readability

Reproducibility

Code: https://github.com/41ForOne/CoTJudger

Code is publicly available at https://github.com/41ForOne/CoTJudger. The core development set of 2,688 CoTs is described. The paper relies on 'GPT-5' (hypothetical or future model in 2026 context) for atomization and verification, which is a closed-source dependency.

📊 Experiments & Results

Evaluation Setup

Evaluation of 21 LRMs across 4 domains (Math, Programming, PCB, General Reasoning) using 896 queries

Benchmarks:

OmniMath (Math Reasoning)
HumanEval (Programming)
GPQA (PCB (Science))
Big-Bench Hard (General Reasoning)

Metrics:

Redundancy Ratio (RR)
Shortest Effective Path Length (L_eff)
Average Degree (Topology)
Isolated Node Ratio
Statistical methodology: Stratified sampling for dataset construction; correlations observed but no formal significance tests reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of Redundancy Ratios (RR) reveals extreme inefficiency in certain high-performing models and distilled variants.
Average across tasks	Redundancy Ratio (RR)	Low (Pareto frontier)	86.5%	High
Average across tasks	Redundancy Ratio (RR)	Low	78.0%	High
Average across tasks	Average Degree	1.0	1.75	+0.75
Average across tasks	Average Degree	1.0	1.13	+0.13
Average across tasks	Token Count	451	8817	+8366

Experiment Figures

Token length distribution across models

Topological behavior (Average Degree) across difficulty levels

Main Takeaways

Redundancy is not monolithic: DeepSeek-R1 shows 'Cyclic Complexity' (loops), Qwen3-Max shows 'Semantic Verbosity' (loose text), and Gemini-3-Pro shows 'Local Over-Optimization' (micro-backtracking).
Distillation transmits redundancy: Distilled models (from DeepSeek-R1) inherit high redundancy (>69% RR) but often lack the verification stability of the teacher, leading to 'Destructive Revision' where correct answers are discarded.
Redundancy peaks mid-generation: Structural analysis shows redundancy is low at start, plateaus in the middle (stabilization/context maintenance), and peaks again just before the answer (confidence calibration).
Test-time scaling instability: Flash/smaller models show extreme token outliers (>60k tokens), indicating broken halting mechanisms on edge cases.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Directed Acyclic Graphs (DAGs) and Graph Traversal (DFS)
Model Distillation concepts

Key Terms

LRM: Large Reasoning Model—LLMs optimized for complex reasoning, often producing long CoT traces (e.g., OpenAI o1, DeepSeek-R1)

SEP: Shortest Effective Path—the shortest sequence of logically self-consistent nodes in a CoT graph sufficient to reach the correct answer

Redundancy Ratio (RR): The proportion of nodes in a CoT that are not part of the Shortest Effective Path

Average Degree: A topological metric measuring graph density; values >1.0 indicate branching, looping, or backtracking beyond a linear path

Atomic Node: A single, indivisible functional step in a reasoning chain (e.g., one calculation or one verification check)

Logical Epicenter: A node with high in-degree/out-degree, acting as a hub for repeated branching or looping (indicating a stuck point)

PCB: Physics, Chemistry, Biology—a domain grouping for scientific reasoning tasks

DFS: Depth-First Search—an algorithm for traversing tree or graph data structures

CoT: Chain-of-Thought—intermediate reasoning steps generated by an LLM before the final answer