โ† Back to Paper List

ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems

Noel Thomas
arXiv (2026)
Benchmark Reasoning Factuality QA

๐Ÿ“ Paper Summary

Scientific Reasoning Neuro-symbolic AI Dynamical Systems
ChaosBench-Logic evaluates LLM reasoning on chaotic systems using a formal first-order logic ontology, revealing that frontier models fail to maintain logical consistency despite high surface-level accuracy.
Core Problem
LLMs excel at natural language but are brittle in scientific domains requiring precise symbolic reasoning, often confusing deterministic chaos with randomness or complexity.
Why it matters:
  • Misinterpreting dynamical systems (e.g., conflating chaos with stochasticity) leads to incorrect scientific conclusions and undermines trust in AI-driven discovery
  • Existing benchmarks focus on forecasting or general math, missing the formal logical semantics required to distinguish qualitative regimes (e.g., chaotic vs. quasi-periodic)
  • Models often hallucinate intermediate steps or fail to maintain coherence across multi-turn reasoning, which is critical for scientific deduction
Concrete Example: A model might correctly identify a system as 'Sensitive to initial conditions' but then incorrectly label it 'Random' in a subsequent turn, violating the axiom that chaos is deterministic. Existing benchmarks miss this because they check single answers rather than logical consistency.
Key Novelty
ChaosBench-Logic: First-Order Logic Constraint Testing for Chaos Theory
  • Defines a unified logic ontology (11 predicates, e.g., Chaotic, PosLyap) and global axioms (e.g., Chaotic implies Deterministic) to ground reasoning in formal semantics
  • Evaluates models not just on answer accuracy, but on 'implication consistency'โ€”whether their answers across different questions respect the logical rules of dynamical systems
  • Includes bias probes specifically designed to trigger common misconceptions, such as the belief that all nonlinear systems are chaotic
Evaluation Highlights
  • Frontier models (GPT-4, Claude 3.5 Sonnet) achieve high local accuracy (91โ€“94%) on atomic questions but drop to 0% accuracy on compositional reasoning items
  • Dialogue coherence is fragile: while single-turn accuracy is high, consistent reasoning across 3-6 turns drops to 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zeroshot)
  • Models frequently violate domain axioms, such as asserting a system is both 'Chaotic' and 'Random', exposing deep semantic confusion despite correct surface-level answers
Breakthrough Assessment
8/10
Strong contribution to scientific reasoning evaluation. By enforcing logical consistency over mere fact-retrieval, it exposes critical fragility in how LLMs handle formal scientific definitions.
×