DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

📝 Paper Summary

Synthetic Data Generation Reasoning

DESIGNER synthesizes complex multidisciplinary reasoning questions by reverse-engineering 'Design Logic' from expert exams and applying these structured templates to new raw texts.

Core Problem

Existing synthetic data methods struggle to generate complex, multi-step reasoning questions across diverse disciplines, often defaulting to simple factual recall or being limited by seed question pools.

Why it matters:

LLMs lag behind human experts in university-level, discipline-specific reasoning due to scarce high-quality training data beyond math and code
Document-centric synthesis methods (using raw text) ensure coverage but lack control over difficulty and reasoning depth
Query-centric methods (rewriting seeds) are limited by the bias and coverage of the initial seed pool

Concrete Example: When generating questions from a history textbook, standard methods might ask 'When did the war start?' (factual recall). A human expert, however, would design a question requiring analysis of causes and effects. Without explicit guidance, LLMs fail to spontaneously generate these complex structures from raw text.

Key Novelty

Design-Logic-Guided Data Synthesis

Reverse-engineers 'Design Logic' from difficult human exam questions—abstract meta-knowledge describing the step-by-step process of constructing a complex question (e.g., set objective → build context → add distractors)
Decouples reasoning structure from content: applies these abstract Design Logics to entirely new source documents (books/web) to synthesize questions that retain the structural complexity of exams but cover new knowledge

Architecture

The overall DESIGNER pipeline: Data Curation → Design Logic Extraction → Question Synthesis.

Evaluation Highlights

Synthesized 4.7 million questions (DLR-Book and DLR-Web) across 75 disciplines, with substantially higher difficulty than baselines (e.g., significantly more 'Very Hard' questions)
Qwen3-7B-Instruct fine-tuned on DESIGNER data outperforms its official post-trained version on GPQA-Diamond (32.8 vs 29.8) and MMLU-Pro (48.4 vs 47.9)
Llama-3.1-8B-Instruct fine-tuned on DESIGNER data achieves +7.2 accuracy gain on MMLU-Pro compared to the base model, surpassing the official Instruct version

Breakthrough Assessment

8/10

Proposes a novel 'Design Logic' abstraction that effectively bridges the gap between scalable but shallow document-based generation and high-quality but scarce human exam data. Strong empirical results on difficult reasoning benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Synthetic data generation for multidisciplinary reasoning

Inputs: Raw text corpora (books, web) and a seed bank of human exam questions

Outputs: Synthetic question-answer pairs with high reasoning complexity

Pipeline Flow

Data Curation: Filter & cluster raw questions and documents
Design Logic Extraction: LLM abstracts logic from seed questions
Logic Deduplication: Graph-based filtering of similar logics
Question Synthesis: Retrieve logic → Generate question from doc
Response Synthesis: Generate CoT answers

System Modules

Curator

Filter and select high-quality source texts and seed questions

Model or implementation: Qwen3-30B-A3B (non-thinking), ModernBERT-large

Logic Extractor (Logic Processing)

Reverse-engineer Design Logic from seed questions

Model or implementation: DeepSeek-R1-0528

Logic Deduplicator (Logic Processing)

Remove redundant logics to ensure diversity

Model or implementation: Qwen3-Embedding-4B (for similarity)

Synthesizer

Generate new questions by applying Design Logic to source text

Model or implementation: DeepSeek-R1-0528

Novel Architectural Elements

Abstraction of 'Design Logic' as an intermediate representation for data synthesis, decoupling reasoning structure from domain content
Retrieve-and-Generate mechanism for matching abstract Design Logics to raw text segments based on embedding similarity

Modeling

Base Model: Qwen3-7B-Instruct / Llama-3.1-8B-Instruct

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

DLR-Book (3.04M questions)
DLR-Web (1.66M questions)
Mixed with 200k general instruction data (Magpie-Pro-300K, HelpSteer2)

Key Hyperparameters:

learning_rate: 5e-6
batch_size: 128
epochs: 1
+ 3 more
max_length: 8192
scheduler: cosine
warmup_ratio: 0.03

Compute: 8x H800 GPUs

Comparison to Prior Work

vs. Evol-Instruct: DESIGNER uses document-grounded generation guided by expert logic rather than just rewriting existing queries, ensuring better disciplinary coverage and correctness.
vs. Cosmopedia: DESIGNER focuses on exam-style reasoning questions rather than textbook content generation.
vs. Nemotron [not cited in paper]: DESIGNER explicitly models the *process* of question design (Design Logic) to control difficulty, whereas Nemotron relies on direct generation which may lack depth control.
+ 1 more
vs. Bonito [not cited in paper]: Bonito converts unannotated text to instruction tuning data using conditional task generation; DESIGNER adds the explicit intermediate step of 'Design Logic' retrieval to enforce reasoning complexity.

Limitations

Relies on a high-quality seed question bank for extracting Design Logics.
Logic matching is based on embedding similarity, which might not always capture the best structural fit for a text segment.
Evaluation focuses on multiple-choice benchmarks; less analysis on open-ended generation quality beyond CoT correctness.

Reproducibility

Code: https://attention-is-all-i-need.github.io/Design-Logic-Reasoning

Datasets DLR-Book and DLR-Web are released. Logic extraction and synthesis prompts are provided in figures. Base models (Qwen3, Llama3) are public. Proprietary question bank used for seed logics is not released, but method is claimed to work on any bank.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on diverse reasoning benchmarks

Benchmarks:

MMLU-Pro (Multi-discipline reasoning (harder MMLU))
GPQA-Diamond (Graduate-level science QA)
MATH (Mathematics problems)
SciBench (Scientific reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SFT on DESIGNER data significantly improves performance over base models and other synthetic baselines.
MMLU-Pro	Accuracy	44.6	48.2	+3.6
GPQA-Diamond	Accuracy	29.8	32.8	+3.0
MMLU-Pro	Accuracy	42.1	48.2	+6.1
SciBench	Accuracy	6.87	10.05	+3.18
Ablation studies confirm the value of both Book and Web sources and the Design Logic method.
Average (MMLU-Pro, GPQA, etc.)	Score	44.6	46.2	+1.6

Experiment Figures

Bar chart comparing difficulty distribution of DLR datasets vs baselines.

Donut charts showing disciplinary distribution.

Main Takeaways

Synthesized data (DLR) enables base models to outperform their official Instruct versions on hard reasoning benchmarks.
DLR data is significantly more effective than existing open-source synthetic datasets (Magpie, OpenMathInstruct) for multidisciplinary reasoning.
Design Logic guidance produces harder and more diverse questions compared to direct generation or rewriting methods.
Combining Book and Web corpora (DLR-Mix) yields the best overall performance, suggesting complementary knowledge sources.

📚 Prerequisite Knowledge

Prerequisites

Language Model Fine-tuning (SFT)
Retrieval-Augmented Generation (RAG)
Synthetic Data Generation pipelines

Key Terms

Design Logic: A structured, reusable abstraction of the thought process human experts use to create exam questions (e.g., objectives, constraints, reasoning paths)

SFT: Supervised Fine-Tuning—training a model on labeled examples to follow instructions or learn specific behaviors

MinHash: A technique for quickly estimating the similarity between two sets, used here for deduplicating text data

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

GPQA: A challenging QA benchmark (Graduate-Level Google-Proof Q&A) designed to be difficult even for experts

MMLU-Pro: An enhanced version of the Massive Multitask Language Understanding benchmark with harder questions and more distractors