Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design

📝 Paper Summary

Neurosymbolic AI AI for Mathematics

An LLM agent equipped with symbolic tools and persistent memory collaborates with a human researcher to discover a new tight lower bound on Latin square imbalance by identifying data patterns symbolic methods missed.

Core Problem

Current AI for math focuses on fixed objectives (proving conjectures, optimizing known quantities), failing at open-ended discovery where the hypothesis itself must emerge from exploration.

Why it matters:

Pure symbolic methods hit combinatorial walls (e.g., exhaustive search fails at n=18 for this problem)
Pure neural methods (LLMs) lack the rigor required for proofs and often hallucinate constructive claims
Understanding the distinct roles of neural intuition, symbolic verification, and human steering is critical for replicating discovery workflows

Concrete Example: When searching for 'perfect permutations' to minimize imbalance, the symbolic solver hit a timeout at n=18. The agent, analyzing numerical data from a relaxed search, noticed a parity constraint (all distances were even) that the human had missed, leading to a new theorem.

Key Novelty

Agentic Neurosymbolic Collaboration Framework

Integrates an LLM agent (Claude Opus 4.5) with external symbolic tools (SageMath, Rust solvers) and a structured 'progressive disclosure' persistent memory to maintain context across sessions
Utilizes multi-model deliberation among frontier LLMs specifically for criticism and error detection, leveraging the finding that models are better at critiquing than constructing
Formalizes a workflow where the agent handles hypothesis generation/coding, symbolic tools handle verification, and the human provides strategic pivots (reframing the research question)

Evaluation Highlights

Discovered and proved a tight lower bound of 4n(n-1)/9 for Latin square imbalance (n ≡ 1 mod 3), settling an open theoretical question
Constructed 'near-perfect permutations' achieving this bound for all n up to 52, exceeding the prior computational limit of n=18
Refuted a multi-model consensus claim that modular inversion achieves O(n^2.5) imbalance, demonstrating it scales as Θ(n^3.6) instead

Breakthrough Assessment

9/10

Demonstrates genuine mathematical discovery (new theorem + construction) settling an open problem. The detailed process analysis of human-AI interaction offers a replicable blueprint for neurosymbolic research.

⚙️ Technical Details

Problem Definition

Setting: Open-ended mathematical exploration and optimization in combinatorial design theory

Inputs: Natural language research goals, strategic direction from human researcher

Outputs: Mathematical conjectures, executable code, formal proofs, and verified combinatorial objects

Pipeline Flow

Human Researcher (Strategy) → AI Agent (Hypothesis/Code)
AI Agent ↔ Symbolic Tools (Verification/Search)
AI Agent ↔ Persistent Memory (Context)
AI Agent ↔ Multi-Model Review (Criticism)

System Modules

Human Researcher

Sets strategic goals, initiates pivots (e.g., changing from search to optimization), and reviews outputs

Model or implementation: Human Expert

AI Agent

Translates goals into code, runs tools, analyzes data for patterns, and drafts proofs

Model or implementation: Claude Opus 4.5

Symbolic Tools

Execute rigorous computation: algebraic analysis (SageMath), exhaustive enumeration (Rust), and heuristic search (Simulated Annealing)

Model or implementation: SageMath, Custom Rust Solver, Python JIT

Persistent Memory

Maintains project state across sessions without weight updates via a two-tier file system

Model or implementation: File System (JSON/Text)

Novel Architectural Elements

Two-tier persistent memory system enabling multi-session continuity and 'incremental symbolic learning' without model training
Asymmetric use of multi-model deliberation: trusted for criticism/error detection but distrusted for constructive claims

Modeling

Base Model: Anthropic Claude Opus 4.5

Comparison to Prior Work

vs. FunSearch/AlphaGeometry: Addresses open-ended exploration where the hypothesis is not known a priori, rather than optimizing a fixed objective or proving a stated theorem
vs. Pure Symbolic Search: Overcomes combinatorial explosions (n=18 wall) by pivoting to heuristic search and pattern recognition
vs. Standard LLM Assistants: Introduces persistent memory and tool integration to enable multi-session research continuity

Limitations

Multi-model deliberation reliably hallucinates constructive mathematical claims (e.g., wrong bounds for modular inversion)
Success relied critically on human strategic pivots (e.g., reframing the question from 'find zero imbalance' to 'minimize positive imbalance')
The algebraic approach proved to be a dead end, consuming significant time before the pivot

📊 Experiments & Results

Evaluation Setup

Mathematical proof verification and computational search for Latin square imbalance

Benchmarks:

Imbalance of Latin Squares (n ≡ 1 mod 3) (Combinatorial Optimization / Theoretical Bound)

Metrics:

Imbalance value I(L)
Maximum n for which solution is found
Statistical methodology: Not applicable (Mathematical proof)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The agent established a new theoretical lower bound and computationally verified it, significantly outperforming naive estimates and prior search limits.
Latin Square Imbalance (n=13)	Lower Bound Value	26	69.33	+43.33
Latin Square Enumeration	Max n Solved	18	52	+34
Modular Inversion Map Imbalance	Scaling Law	O(n^2.5)	Θ(n^3.6)	Disproven

Main Takeaways

The 'Research Pivot' (changing the question from search to optimization) was the single most critical step, initiated by the human.
The agent's primary contribution was 'uncovering hidden structure' (the parity constraint) from data, which human experts missed.
Multi-model deliberation is asymmetric: highly reliable for catching errors in proofs (criticism) but unreliable for generating new mathematical claims.
Persistent memory enabled the agent to avoid repeating the documented algebraic 'dead end' in later sessions.

📚 Prerequisite Knowledge

Prerequisites

Combinatorial design theory (Latin squares)
Basic group theory (permutations, orbits)
Neurosymbolic AI concepts

Key Terms

Latin square: An n × n array filled with n different symbols such that each symbol occurs exactly once in each row and column

Imbalance: A metric measuring the deviation of a Latin square from perfect spatial balance; zero imbalance is impossible for n ≡ 1 (mod 3)

Near-perfect permutation: A novel class of permutations introduced in this paper where shift correlations deviate minimally from the ideal value, satisfying parity constraints

SBLS: Spatially Balanced Latin Square—a Latin square where the average distance between row pairs is uniform

Simulated Annealing: A probabilistic optimization technique used here by the agent to find permutations when exhaustive search failed

SageMath: A computer algebra system used by the agent for algebraic analysis and polynomial interpolation

Claude Opus 4.5: The specific Large Language Model used as the core of the AI agent

Neurosymbolic AI: The integration of neural networks (like LLMs) with symbolic reasoning tools (logic solvers, algebra systems)

Progressive disclosure: A memory design where the agent sees only a high-level index of files initially and retrieves full content on demand to manage context window limits