TopoBench: Benchmarking LLMs on Hard Topological Reasoning

📝 Paper Summary

Spatial Reasoning Reasoning Benchmarks

TopoBench reveals that LLMs fail at topological puzzles primarily due to spatial constraint extraction rather than reasoning logic, a bottleneck significantly mitigated by external structured tools.

Core Problem

LLMs struggle to maintain global spatial invariants (connectivity, loop closure, symmetry) across multi-step reasoning chains because they cannot reliably parse 2D spatial structures from linear token streams.

Why it matters:

Global spatial understanding is critical for real-world tasks like circuit layout, route planning, and molecular structure analysis where one violation invalidates the solution
Current benchmarks focus on local pattern matching or arithmetic, failing to test the ability to maintain consistency under sequential state updates
Existing evaluations rarely disentangle whether failures stem from reasoning logic deficits or representation/parsing limitations

Concrete Example: In a 'Bridges' puzzle, a model might correctly connect two islands but fail to realize this action isolates a third island, violating the global network connectivity constraint required for the solution.

Key Novelty

Diagnostic Benchmarking for Topological Reasoning

Introduces TopoBench, a suite of six puzzle families (e.g., Bridges, Loopy) testing specific invariants like connectivity and symmetry across three difficulty tiers
Implements a causal diagnostic pipeline that injects specific error types (like constraint violations) into gold solution paths to measure their actual impact on accuracy
Demonstrates that offloading spatial state tracking to an external tool engine recovers significant performance, isolating the bottleneck to perception rather than logic

Evaluation Highlights

Frontier models struggle: GPT-5-mini-high achieves only 0.24 accuracy on the hard tier, while DeepSeek V3.2 reaches just 0.10
Causal interventions reveal 'Premature Commitment' causes a ~20.8 percentage point accuracy drop on Bridges, significantly more than other error types
Tool-augmented reasoning (providing structured constraints) improves accuracy by 10% on Hard Bridges compared to the no-tool baseline

Breakthrough Assessment

8/10

Strong contribution in diagnosing *why* LLMs fail at reasoning. The causal intervention methodology is a significant advance over standard error taxonomy tagging.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot or few-shot reasoning on grid-based topological puzzles without external code execution (except in tool-augmented ablations)

Inputs: ASCII text representation of a puzzle grid plus rules

Outputs: A valid grid solution satisfying all topological constraints (e.g., no crossing paths, single closed loop)

Pipeline Flow

Input Representation (ASCII/Integer Grid)
Prompting (One-shot with rules)
Inference (LLM generates Chain-of-Thought)
Output Parsing (Extract JSON grid)
Verification (Rule-based check)

System Modules

Input Encoder

Convert puzzle state into text format

Model or implementation: Deterministic script

Reasoning Engine

Generate solution steps and final grid

Model or implementation: Evaluated LLMs (e.g., DeepSeek V3.2, GPT-5-mini-high)

State Engine (Ablation only)

Maintain authoritative board state and calculate constraints

Model or implementation: External Python Engine

Novel Architectural Elements

Causal diagnostic pipeline: Systematically injecting error patterns (premature commitment, constraint violations) into partial gold solution prefixes to measure downstream accuracy impact

Modeling

Base Model: Evaluated suite includes GPT-5-mini-high, Gemini-3-Flash, DeepSeek V3.2, LLaMA-3.1-405B, etc.

Compute: Inference only. 100k token limit per evaluation attempt.

Comparison to Prior Work

vs. GridPuzzle: TopoBench adds causal interventions on gold prefixes to validate if observed errors actually cause failure
vs. Sudoku-Bench: Focuses on global topological invariants (connectivity, loops) rather than local row/column constraints
vs. ARC-AGI [not cited in paper]: Focuses on specific logical constraint satisfaction rather than broad abstract pattern induction

Limitations

Tool-augmented experiments limited to Bridges hard difficulty due to cost/complexity
Analysis focuses primarily on DeepSeek V3.2 traces; other models might have different error distributions
Binary verification (correct/incorrect) does not capture partial progress in the final grid
Prompt-level interventions (planning guidance) failed to yield improvements, leaving architectural changes as the primary solution

Reproducibility

Code: https://github.com/mayug/topobench-benchmark

publicly available (github.com/mayug/topobench-benchmark). Includes code and data. Full prompts provided in Appendix N. Engine parameters for puzzle generation provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Single-attempt (pass@1) evaluation on 900 puzzle instances (6 families x 3 difficulties x 50 instances)

Benchmarks:

TopoBench (Topological Grid Puzzles) [New]

Metrics:

Accuracy (pass@1)
Statistical methodology: 95% Wilson score confidence intervals reported for intervention experiments

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Frontier models struggle significantly as difficulty increases, with hard puzzles remaining largely unsolved.
TopoBench (Hard Tier)	Accuracy	0.10	0.24	+0.14
Causal interventions demonstrate that 'Premature Commitment' and 'Constraint Forgetting' are the primary drivers of failure.
Bridges (Intervention)	Accuracy	0.768	0.560	-0.208
Bridges (Intervention)	Accuracy	0.768	0.662	-0.106
Tool augmentation experiments show that accessing structured state information significantly boosts performance.
Bridges (Hard)	Accuracy	0.40	0.50	+0.10

Experiment Figures

Accuracy drops caused by injecting specific error types into DeepSeek V3.2's reasoning chain on Bridges and Undead puzzles

Main Takeaways

Constraint extraction is the bottleneck: providing structured tools improves accuracy by 10% on hard tasks, proving reasoning capability exists but is hampered by parsing
Error frequency does not equal error impact: 'Constraint Forgetting' is rare in traces (4%) but causes massive accuracy drops when injected, whereas 'Repeated Reasoning' is common (33%) but benign
Spatial parsing matters: changing input format to cell-aligned integers improves accuracy by 30-40pp on Bridges and Galaxies for some models
Prompting strategies (planning, backtracking instructions) are ineffective at correcting these failures compared to format/tool changes

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Chain-of-Thought (CoT) prompting
Basic understanding of constraint satisfaction problems
Knowledge of tokenization effects on spatial data

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

BPE: Byte Pair Encoding—a tokenization method that breaks text into subword units, often obscuring the vertical alignment of characters in ASCII grids

pass@1: A metric measuring the percentage of problems solved correctly on the first attempt

premature commitment: An error mode where the model goes down a wrong solution path early in the reasoning process without sufficient evidence

constraint forgetting: An error mode where the model proposes moves that violate explicit puzzle rules (e.g., crossing lines)

repeated reasoning: A behavioral pattern where the model retries the same reasoning path without variation; found to be a benign symptom of search rather than a cause of failure

global invariants: Properties that must hold true for the entire system simultaneously, such as 'all nodes must be connected' or 'the loop must be closed'

LLM-as-a-judge: Using a strong language model to annotate or evaluate the outputs of another model (used here to classify error types in reasoning traces)