Is deep learning a useful tool for the pure mathematician?

📝 Paper Summary

AI for Mathematics Machine Learning for Combinatorics Conjecture Generation

Deep learning serves as a 'bicycle for the mind' in pure mathematics by identifying patterns in high-dimensional objects to guide conjecture formulation and counter-example search.

Core Problem

Pure mathematics problems often involve high-dimensional structures or complex invariants where traditional intuition fails, yet rigorous reasoning remains unsolved by current AI.

Why it matters:

Pure mathematicians often lack tools to explore high-dimensional spaces where intuition breaks down, limiting hypothesis generation
Current AI reasoning capabilities are insufficient for proving theorems, but pattern recognition capabilities are underutilized for guiding human proof attempts
Traditional brute-force search for counter-examples is computationally intractable for large combinatorial spaces

Concrete Example: When studying permutations in the symmetric group, the 'right descent set' is linearly predictable, but the 'left descent set' is hard to learn from vector inputs. A mathematician manually inspecting data might miss that simply changing the input representation to permutation matrices makes both sets equally learnable, revealing structural symmetries.

Key Novelty

Saliency-Guided Conjecture Generation

Train a neural network to predict a mathematical invariant (e.g., Kazhdan-Lusztig polynomial) from a structural input (e.g., Bruhat graph)
Use saliency analysis (gradients of the learned function) to identify which parts of the input structure most influence the output
Mathematicians inspect these high-saliency sub-structures to formulate precise conjectures about the underlying mathematical relationship

Evaluation Highlights

Graph Neural Network achieved ~98% accuracy predicting Kazhdan-Lusztig polynomials from Bruhat graphs (trained on ~20,000 graphs)
Neural network-guided search found a counter-example to a graph theory conjecture (regarding eigenvalue λ and matching number µ) on a 19-vertex graph
Transformer model trained on matrices with Laplace-distributed eigenvalues generalized to positive-only eigenvalue matrices despite never seeing them during training

Breakthrough Assessment

7/10

While not proposing a new ML architecture, it successfully demonstrates a novel *workflow* (ML-guided intuition) that led to high-profile results in Nature, validating AI's utility in abstract math.

⚙️ Technical Details

Problem Definition

Setting: Supervised learning of a function ϕ: R^n -> R^m approximating a mathematical map, or optimization over a combinatorial space X to minimize/maximize a quantity Z.

Inputs: Mathematical objects encoded as vectors (e.g., permutation matrices, flattened graphs, sequence tokens)

Outputs: Predicted invariants (probabilities, coefficients) or construction steps for a counter-example

Pipeline Flow

Problem Selection (Noise-stable, High-dimensional)
Data Generation (Compute ground truth invariants)
Model Training (Supervised prediction or RL-based construction)
Analysis (Saliency maps or counter-example verification)

System Modules

Function Approximator

Learn the mapping from mathematical object to invariant

Model or implementation: Various (MLP for permutations, GNN for graphs, Transformer for linear algebra)

Saliency Analyzer

Identify influential substructures in the input to guide human intuition

Model or implementation: Gradient-based attribution (Vanilla Gradient)

Construction Agent

Construct counter-examples step-by-step

Model or implementation: MLP (Input: current graph state; Output: probability of adding edge)

Novel Architectural Elements

Integration of saliency analysis directly into the pure mathematics proof workflow to generate conjectures
Use of the cross-entropy method with neural networks to specifically minimize mathematical conjectures (finding counter-examples) rather than standard RL rewards

Modeling

Base Model: Varies by experiment: Simple MLPs (descent sets), Graph Neural Networks (Kazhdan-Lusztig), Transformers (linear algebra)

Training Method: Supervised Learning (for invariants) and Cross Entropy Method (for counter-examples)

Objective Functions:

Purpose: Minimize prediction error for invariant learning.

Formally: Cross Entropy Loss (for probability distributions) or MSE (for values).
Purpose: Minimize a graph-theoretic quantity Z to disprove a bound B.

Formally: Optimize network weights to generate graphs minimizing Z < 0 (where conjecture states Z >= 0).

Training Data:

Bruhat graphs: ~20,000 examples
Linear Algebra: Matrices with signed floating point numbers (3 sig figs)
Descent sets: Permutations of length n=35

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
layer_dimensions: 500 and 100 (for descent set MLP experiment)

Compute: Descent set training takes 'a few seconds' for right descent sets. Kazhdan-Lusztig training took < 1 day.

Comparison to Prior Work

vs. Traditional Computer Algebra Systems: Focuses on 'system 1' intuitive pattern matching rather than symbolic derivation
vs. Theorem Provers (Lean/Coq): Focuses on conjecture generation and object construction rather than formal verification
vs. Standard RL: Uses Cross Entropy Method for counter-example search instead of PPO/DQN [not cited in paper]

Limitations

Neural networks struggle with noise-sensitive functions (e.g., parity bit, Mobius function)
Performance is highly sensitive to input representation (e.g., vector vs. matrix for permutations)
Saliency analysis is brittle and requires careful interpretation by domain experts
Requires functions to be efficiently computable to generate training data

Reproducibility

Code: https://sites.google.com/view/mlwm-seminar-2022

Code for specific experiments (parity, descent sets) is available via Google Colab links in footnotes. The Wagner counter-example code and Charton transformer code are referenced as external works.

📊 Experiments & Results

Evaluation Setup

Various distinct experiments: predicting permutation properties, counter-example search in graph theory, and predicting polynomial coefficients.

Benchmarks:

Symmetric Group Descent Sets (Binary classification of indices) [New]
Bruhat Graph / Kazhdan-Lusztig (Coefficient prediction (regression/classification)) [New]
Wagner's Graph Conjecture (Counter-example construction) [New]

Metrics:

Accuracy (prediction)
Success rate (counter-example finding)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments on learning descent sets in the symmetric group demonstrate extreme sensitivity to input representation and problem formulation.
Symmetric Group Descent Sets (n=35)	Accuracy (Qualitative)	0	1	Not reported in the paper
Experiments on the Combinatorial Invariance Conjecture demonstrate the ability of GNNs to learn complex invariants.
Bruhat Graph to Kazhdan-Lusztig Polynomial	Accuracy	Not reported in the paper	98	Not reported in the paper
Counter-example search using the Cross Entropy Method.
Graph Eigenvalue Conjecture (Wagner)	Counter-example size (vertices)	Not reported in the paper	19	Not reported in the paper

Main Takeaways

Representation Matters: Learning 'left descent sets' from permutation vectors is exponentially harder than 'right descent sets' due to how the input encodes information, even though the problems are mathematically symmetric.
Generalization Surprise: Transformers trained on Wigner matrices (uniform entries) fail on positive-eigenvalue matrices, but those trained on Laplace distributions (heavy tails) generalize perfectly to positive-eigenvalue matrices despite never seeing them.
Saliency is Useful: Gradient-based saliency on Bruhat graphs revealed 'hypercube-like' structures, leading directly to a new mathematical conjecture.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of neural network architectures (MLP, GNN, Transformer)
Familiarity with supervised learning and gradient descent
Basic concepts in combinatorics (graphs, permutations, matrices)

Key Terms

saliency analysis: Techniques (like computing input gradients) to determine which features of the input most strongly affect the neural network's output, used here to spot mathematical patterns

Kazhdan-Lusztig polynomial: A complex polynomial associated with a pair of permutations in a Coxeter group, central to representation theory

Bruhat graph: A directed graph structure representing relations between elements of a Coxeter group (like the symmetric group)

cross entropy method: An optimization technique where samples are drawn from a distribution, the best performers are selected, and the distribution is updated to increase the likelihood of those performers

descent set: In combinatorics, the set of indices where a permutation value decreases (e.g., i such that x(i) > x(i+1))

parity bit: A function that sums bits modulo 2; used here as an example of a noise-sensitive function that neural networks struggle to learn without sufficient density

transformer: A deep learning architecture based on self-attention mechanisms, typically used for sequence-to-sequence tasks

graph neural network (GNN): A neural network architecture designed to operate on graph structures, capturing relationships between nodes and edges

Wigner matrices: Random matrices where entries are independent and identically distributed (often from a uniform distribution)

support vector machine (SVM): A supervised learning model that classifies data by finding an optimal hyperplane that separates classes with the maximum margin