STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

📝 Paper Summary

Benchmark datasets Agentic data generation

STORM-BORN is a highly challenging dataset of mathematical derivations extracted from research papers using a multi-agent framework and filtered by experts, on which even GPT-o1-Pro fails significantly.

Core Problem

Existing mathematical datasets are either too simple (numerical reasoning) or rely on formal languages that lack human-like intuition, while synthetic data often suffers from unreliable annotations.

Why it matters:

Current LLMs have saturated traditional math benchmarks like GSM8K (~95% accuracy), necessitating harder challenges to probe intelligence upper bounds
Formal proof datasets (e.g., MiniF2F) obscure interpretable, intuitive reasoning processes required for human-like mathematical understanding
Reliable expert annotation for complex math is costly, and single-LLM generation lacks the necessary quality and reliability for deep derivation tasks

Concrete Example: Numerical datasets ask for the expected value of a coin toss (calculating 0.25). Formal datasets ask to prove '7(3y+2)=21y+14' in Lean code. STORM-BORN asks to derive a specific partition function Z(x) for a KL-constrained reward maximization objective (Equation 4) based on a prior KL divergence formula (Equation 3) from a specific research paper, requiring multiple logical leaps and definitions.

Key Novelty

Human-in-the-Loop Multi-Agent Framework (STORM)

Decomposes complex data generation into specialized agents (Extraction, Query Drafting, Retrieval, Context Collection) to handle long-context reasoning better than single models
Integrates 'Reasoning-dense Content Filtering' to select source material rich in derivations (e.g., proofs in appendices) rather than simple descriptions
Employs a rigorous human-expert selection process to curate only the most challenging, creative, and reasoning-dense problems from the synthetic pool

Architecture

Overview of the data generation framework consisting of Filtering, Multi-agent Generation, and Human Selection.

Evaluation Highlights

Less than 5% accuracy for state-of-the-art models (GPT-o1-Pro, DeepSeek-R1) on STORM-BORN, compared to ~95% on GSM8K
+9.12% accuracy improvement on MATH benchmark for Qwen2.5-7B after fine-tuning on just 100 STORM-BORN samples
TinyLlama-1.1B achieves a 233% relative improvement on the MATH dataset after fine-tuning on STORM-BORN

Breakthrough Assessment

8/10

Provides a necessary 'next-level' difficulty benchmark where current SOTA fails completely, while demonstrating that small, high-quality data (100 samples) significantly boosts reasoning generalization.

⚙️ Technical Details

Problem Definition

Setting: Generating and validating complex mathematical derivation problems (Question-Answer pairs) from unstructured academic paper texts.

Inputs: Raw PDFs of academic papers (arXiv)

Outputs: Self-contained mathematical questions with step-by-step derivation answers.

Pipeline Flow

Content Filtering: Select reasoning-dense papers
Extraction: Extract LaTeX formulas
Drafting: Agent generates draft queries
Retrieval: Agent retrieves answers from text
Contextualization: Agent collects evidence to make questions self-contained
Refinement: Agent refines Q&A pairs
Filtration: Agent removes irrelevant content
Human Selection: Experts pick top problems

System Modules

Math Expression Extractor Agent

Recognize and extract mathematical expressions from paper PDFs into LaTeX format

Model or implementation: Lightweight multi-modal LLM

Query Draft Agent

Generate initial questions focusing on theorem or formula derivation based on extracted expressions

Model or implementation: GPT-o1-Pro

Answer Retriever Agent

Search the paper for relevant content and extract the answer directly to avoid hallucination

Model or implementation: GPT-o1-Pro

Context Collector Agent (Refinement)

Capture background information/definitions needed to make the Q&A self-contained

Model or implementation: LLM Agent (Specific model not detailed, likely GPT-o1 based context)

Question Refiner Agent (Refinement)

Integrate context evidence into the query and answer to create a standalone problem

Model or implementation: LLM Agent

Novel Architectural Elements

Six-agent sequential pipeline specifically designed to decompose the 'reading-understanding-questioning' process of complex math papers
Hybrid workflow combining automated multi-agent generation with a final 'Human Expert Selection' phase for extreme difficulty filtering

Modeling

Base Model: Llama-3-8B and Qwen2.5-7B (for fine-tuning experiments)

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

STORM-BORN dataset (100 samples)
Comparisons with larger subsets (top-500, 2k)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: GPT-o1-Pro used for data generation (200 USD cost reported)

Comparison to Prior Work

vs. GSM8K/MATH: STORM-BORN focuses on derivation/proofs from research papers, not competition/school math
vs. MiniF2F: Uses natural language and emphasizes human-like heuristic reasoning over formal code verification
vs. FrontierMath: Similar focus on extreme difficulty, but curates from research papers via multi-agent system rather than manual expert creation from scratch [not cited in paper]
+ 1 more
vs. LIMO: Both emphasize high-quality small data, but STORM-BORN specifically targets paper-based derivations

Limitations

Evaluation of complex derivations is difficult and currently requires human experts or very advanced models, as simple string matching fails.
The dataset is small (100 samples), which may limit its use for large-scale pre-training, though it is effective for SFT.
Reliance on proprietary models (GPT-o1-Pro) for the data generation pipeline raises costs and accessibility issues for reproduction.

Reproducibility

Code: https://github.com/lwhere/STORM-BORN

publicly available (https://github.com/lwhere/STORM-BORN). Dataset and code are released. Prompt details for agents are in Appendix. Training hyperparameters for the fine-tuning experiments are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation of advanced LLMs on the dataset, and fine-tuning of smaller models to test generalization.

Benchmarks:

STORM-BORN (Mathematical derivation) [New]
GSM8K (Grade school math)
MATH (Competition math)
AIME 2024/2025 (Advanced math competition)

Metrics:

Accuracy (Solved problems %)
Correctness score (0-1)
Completeness score
Similarity score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of state-of-the-art proprietary models on the STORM-BORN dataset shows extreme difficulty.
STORM-BORN	Accuracy	94.8	4.6	-90.2
STORM-BORN	Accuracy	3.3	3.3	0.0
Fine-tuning experiments demonstrate that training on STORM-BORN improves performance on general math benchmarks.
MATH	Accuracy (4-shot)	54.42	63.54	+9.12
MATH	Accuracy (4-shot)	17.08	13.82	-3.26
GSM8K	Accuracy (0-shot)	15.39	42.91	+27.52
AIME 2024	Accuracy	20.00	23.33	+3.33
Ablation study on dataset size/quality shows that smaller, higher-quality data is better.
GSM8K	Accuracy (8-shot)	14.87	16.98	+2.11

Experiment Figures

Bar chart comparing human expert evaluation of model performance on STORM-BORN vs standard benchmarks.

Main Takeaways

Models trained on high-quality, reasoning-dense data (top-100) outperform those trained on larger, lower-quality sets (2k), confirming the 'quality over quantity' hypothesis for reasoning.
Fine-tuning on derivation tasks (STORM-BORN) generalizes significantly to numerical reasoning tasks (MATH, GSM8K), even though the dataset contains no explicit numerical problems.
Current LLMs (even o1-Pro and R1) fundamentally struggle with the deep, multi-step theoretical derivations found in academic papers.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) fine-tuning
Familiarity with mathematical reasoning benchmarks (GSM8K, MATH)
Basic knowledge of multi-agent systems

Key Terms

STORM: Synergistic Theorem and f ORmula Mining—the multi-agent framework proposed in this paper for extracting and generating math problems

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

ATP: Automated Theorem Proving—using formal logic and computer programs (like Lean or Isabelle) to verify mathematical proofs

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific labeled dataset to improve performance on a target task

Reasoning Density: A measure of how many logical steps, heuristic cues, and trial-and-error processes are contained within a derivation

MLLM: Multi-modal Large Language Model—an AI model capable of processing both text and images (used here for LaTeX extraction)

DPO: Direct Preference Optimization—a method for aligning language models to human preferences (mentioned in the context of a source paper example)

GPT-o1-Pro: A specific high-capability version of the OpenAI o1 model series used as the backbone for the agents