← Back to Paper List

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque, Noel O'Connor, Fergal Reid
arXiv (2026)
Benchmark Reasoning Agent MM

📝 Paper Summary

Spatial Reasoning Reasoning Benchmarks
TopoBench reveals that LLMs fail at topological puzzles primarily due to spatial constraint extraction rather than reasoning logic, a bottleneck significantly mitigated by external structured tools.
Core Problem
LLMs struggle to maintain global spatial invariants (connectivity, loop closure, symmetry) across multi-step reasoning chains because they cannot reliably parse 2D spatial structures from linear token streams.
Why it matters:
  • Global spatial understanding is critical for real-world tasks like circuit layout, route planning, and molecular structure analysis where one violation invalidates the solution
  • Current benchmarks focus on local pattern matching or arithmetic, failing to test the ability to maintain consistency under sequential state updates
  • Existing evaluations rarely disentangle whether failures stem from reasoning logic deficits or representation/parsing limitations
Concrete Example: In a 'Bridges' puzzle, a model might correctly connect two islands but fail to realize this action isolates a third island, violating the global network connectivity constraint required for the solution.
Key Novelty
Diagnostic Benchmarking for Topological Reasoning
  • Introduces TopoBench, a suite of six puzzle families (e.g., Bridges, Loopy) testing specific invariants like connectivity and symmetry across three difficulty tiers
  • Implements a causal diagnostic pipeline that injects specific error types (like constraint violations) into gold solution paths to measure their actual impact on accuracy
  • Demonstrates that offloading spatial state tracking to an external tool engine recovers significant performance, isolating the bottleneck to perception rather than logic
Evaluation Highlights
  • Frontier models struggle: GPT-5-mini-high achieves only 0.24 accuracy on the hard tier, while DeepSeek V3.2 reaches just 0.10
  • Causal interventions reveal 'Premature Commitment' causes a ~20.8 percentage point accuracy drop on Bridges, significantly more than other error types
  • Tool-augmented reasoning (providing structured constraints) improves accuracy by 10% on Hard Bridges compared to the no-tool baseline
Breakthrough Assessment
8/10
Strong contribution in diagnosing *why* LLMs fail at reasoning. The causal intervention methodology is a significant advance over standard error taxonomy tagging.
×