ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems

📝 Paper Summary

Scientific Reasoning Neuro-symbolic AI Dynamical Systems

ChaosBench-Logic evaluates LLM reasoning on chaotic systems using a formal first-order logic ontology, revealing that frontier models fail to maintain logical consistency despite high surface-level accuracy.

Core Problem

LLMs excel at natural language but are brittle in scientific domains requiring precise symbolic reasoning, often confusing deterministic chaos with randomness or complexity.

Why it matters:

Misinterpreting dynamical systems (e.g., conflating chaos with stochasticity) leads to incorrect scientific conclusions and undermines trust in AI-driven discovery
Existing benchmarks focus on forecasting or general math, missing the formal logical semantics required to distinguish qualitative regimes (e.g., chaotic vs. quasi-periodic)
Models often hallucinate intermediate steps or fail to maintain coherence across multi-turn reasoning, which is critical for scientific deduction

Concrete Example: A model might correctly identify a system as 'Sensitive to initial conditions' but then incorrectly label it 'Random' in a subsequent turn, violating the axiom that chaos is deterministic. Existing benchmarks miss this because they check single answers rather than logical consistency.

Key Novelty

ChaosBench-Logic: First-Order Logic Constraint Testing for Chaos Theory

Defines a unified logic ontology (11 predicates, e.g., Chaotic, PosLyap) and global axioms (e.g., Chaotic implies Deterministic) to ground reasoning in formal semantics
Evaluates models not just on answer accuracy, but on 'implication consistency'—whether their answers across different questions respect the logical rules of dynamical systems
Includes bias probes specifically designed to trigger common misconceptions, such as the belief that all nonlinear systems are chaotic

Evaluation Highlights

Frontier models (GPT-4, Claude 3.5 Sonnet) achieve high local accuracy (91–94%) on atomic questions but drop to 0% accuracy on compositional reasoning items
Dialogue coherence is fragile: while single-turn accuracy is high, consistent reasoning across 3-6 turns drops to 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zeroshot)
Models frequently violate domain axioms, such as asserting a system is both 'Chaotic' and 'Random', exposing deep semantic confusion despite correct surface-level answers

Breakthrough Assessment

8/10

Strong contribution to scientific reasoning evaluation. By enforcing logical consistency over mere fact-retrieval, it exposes critical fragility in how LLMs handle formal scientific definitions.

⚙️ Technical Details

Problem Definition

Setting: Logical entailment and consistency checking over a knowledge base of dynamical systems

Inputs: Natural language query about a specific dynamical system s (from a set of 30) involving properties P

Outputs: Binary truth value (YES/NO) derived from the system's ground truth and global axioms

Pipeline Flow

System Encoding (JSON definition of physics/logic)
Template Generation (Semantic constraints -> Questions)
Human Curation (Filtering ambiguity)
Evaluation Pipeline (Inference & Logic Checking)

System Modules

System Encoder (Data Generation)

Define 30 systems with equations, regimes, and ground-truth predicates

Model or implementation: N/A (Manual/Structured Data)

Question Generator (Data Generation)

Generate questions using 70+ templates with semantic constraints

Model or implementation: Template Engine

LLM Inference (Evaluation)

Predict YES/NO for generated questions

Model or implementation: Target LLM (e.g., GPT-4, LLaMA-3)

Logic Checker (Evaluation)

Verify answer correctness and consistency against global axioms

Model or implementation: Symbolic Solver

Novel Architectural Elements

Evaluation pipeline that scores based on logical closure validity rather than just label matching
Integration of a formal axiom system (Phi) directly into the ground-truth generation and evaluation loop

Comparison to Prior Work

vs. SciBench/MATH: Focuses on qualitative semantic reasoning and logical properties rather than equation solving or calculation
vs. TruthfulQA: Targets domain-specific scientific misconceptions (e.g., chaos vs. randomness) rather than general knowledge falsehoods
vs. LogiQA: Embeds reasoning in a specialized scientific ontology (dynamical systems) rather than generic scenarios
+ 1 more
vs. ML-for-Chaos (e.g., Reservoir Computing) [not cited in paper]: Evaluates symbolic reasoning about regimes rather than numerical forecasting accuracy

Limitations

Dependency on 'canonical' regimes; systems with disputed or borderline behavior were removed, potentially simplifying the landscape
Limited to 30 systems, though they cover diverse classes (ODEs, maps, PDEs)
Binary YES/NO format might miss nuance in model reasoning compared to open-ended generation
No training or fine-tuning experiments reported; benchmark is currently evaluation-only

Reproducibility

Code: https://github.com/11NOel11/ChaosBench-Logic

publicly available (https://github.com/11NOel11/ChaosBench-Logic). Dataset on HuggingFace. Code includes generation scripts and evaluation logic. Model weights for open-source baselines (LLaMA-3) are standard; proprietary models (GPT-4, Claude) accessed via API.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Chain-of-Thought (CoT) reasoning on 621 generated questions

Benchmarks:

ChaosBench-Logic (Logical QA & Consistency Checking) [New]

Metrics:

Logical Accuracy (per-item)
Dialogue Accuracy (coherence across turns)
Contradiction Rate (axiom violations)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Models achieve high surface-level accuracy on individual items but fail on compositional and multi-turn consistency tasks.
ChaosBench-Logic	Logical Accuracy (Per-Item)	91	94	+3
ChaosBench-Logic	Accuracy	50	0	-50
ChaosBench-Logic	Dialogue Accuracy	100	53.1	-46.9

Main Takeaways

High per-item accuracy (91-94%) is deceptive; it masks a complete failure (0%) on compositional reasoning tasks.
Models exhibit 'fragile global coherence': they answer correctly in isolation but contradict themselves when pressed in multi-turn dialogues.
Chain-of-Thought (CoT) prompting paradoxically lowered dialogue coherence (53.1%) compared to Zero-shot (75.5%) in some cases, suggesting reasoning traces may drift from formal axioms.
Models struggle to distinguish Chaos from Randomness, frequently violating the axiom that Chaos implies Determinism.

📚 Prerequisite Knowledge

Prerequisites

Basic concepts of dynamical systems (chaos, periodicity, randomness)
First-order logic (predicates, implications, closure)
LLM prompting strategies (Zero-shot, CoT)

Key Terms

Lyapunov exponent: A measure of how fast valid trajectories diverge; a positive exponent is a signature of chaos

Attractor: A set of states toward which a system evolves over time (e.g., fixed point, limit cycle, strange attractor)

First-order logic ontology: A formal set of defined predicates (properties) and rules (axioms) governing the relationships between them

Neuro-symbolic: Approaches combining neural networks (like LLMs) with symbolic logic or structured reasoning

Zero-shot: Asking the model to perform a task without providing any examples in the prompt

Chain-of-thought (CoT): Prompting the model to generate intermediate reasoning steps before the final answer

Forward chaining: A logical inference method that starts with known facts and applies rules to derive new facts until a goal is reached

Logical closure: The set of all facts that can be deduced from a starting set using valid logical rules (axioms)