Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Logical Reasoning Synthetic Data Generation

Enigmata improves LLM reasoning by generating unlimited, verifiable logic puzzles for reinforcement learning, demonstrating that puzzle-solving skills transfer to complex math and STEM domains.

Core Problem

Large Reasoning Models (LRMs) like o1 excel at domain-specific tasks (math/code) via verifiable rewards but struggle with pure logic puzzles that lack training data and automated verification.

Why it matters:

Current puzzle datasets lack the scale and programmatic control needed for modern RL pipelines, relying on limited static examples
Puzzles test pure reasoning capabilities orthogonal to memorized knowledge, making them ideal for improving general intelligence
Existing methods rely on tool-use (code interpreters) rather than improving the model's internal reasoning chain

Concrete Example: In a 'Twiddle' puzzle (arranging numbers by rotating sub-grids), a standard model might hallucinate the grid state after a move because it lacks internal simulation capabilities. Enigmata forces the model to learn the logic by verifying the final arrangement against a programmatic rule.

Key Novelty

Enigmata Suite (Data + Model Recipe)

Pairs every puzzle task with a Python generator (for infinite, controllable difficulty data) and a rule-based verifier (for instant RL rewards)
Proposes a 'Mix-training' and 'Multi-stage' RL recipe that combines rejection fine-tuning with curriculum-based reinforcement learning to prevent catastrophic forgetting

Evaluation Highlights

Qwen2.5-32B-Enigmata surpasses o3-mini-high and o1 on the Enigmata-Eval benchmark (specific scores not reported in text)
Achieves 32.8% on ARC-AGI and 0.6% on ARC-AGI 2, demonstrating strong transfer to abstract reasoning tasks
Scaling to Seed1.5-Thinking (200B MoE) boosts performance on AIME (2024-2025), BeyondAIME, and GPQA Diamond using only knowledge-orthogonal puzzle data

Breakthrough Assessment

8/10

Provides a scalable, verifiable framework for reasoning training that generalizes to math/STEM. The 'free lunch' transfer from synthetic puzzles to advanced math is a significant finding.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) on logic puzzles

Inputs: Natural language description of a logic puzzle P with difficulty d

Outputs: Reasoning chain and final solution s

Pipeline Flow

Puzzle Generator (creates instance P)
LLM (generates solution s)
Verifier (checks s against rules)
RL Update (VC-PPO)

System Modules

Auto-Generator

Generates infinite puzzle instances with controllable difficulty (parameters like grid size, constraints)

Model or implementation: Python Scripts

Reasoning Model

Generates a reasoning chain and final answer

Model or implementation: Qwen2.5-32B / Seed1.5-Thinking

Auto-Verifier

Evaluates correctness of the output for reward calculation

Model or implementation: Rule-based Code

Modeling

Base Model: Qwen2.5-32B and Seed1.5-Thinking (20B activated / 200B total MoE)

Training Method: Rejection Fine-Tuning (RFT) followed by Reinforcement Learning (RL)

Objective Functions:

Purpose: Maximize expected reward from verifiable puzzle solutions.

Formally: VC-PPO (Verifiable Correctness PPO) objective.

Training Data:

RFT: 1:1 ratio of puzzles (Enigmata + ARC-AGI) to math problems (R1-distilled dataset)
RL Mix-training: Enigmata + ARC-AGI + AIME (1983-2023) in 1:1 puzzle-to-math ratio

Key Hyperparameters:

difficulty_levels: Easy, Medium, Hard (controlled by pass@k metrics)
sampling_count: 8 candidate solutions per instance for RFT data generation

Compute: Not reported in the paper

Comparison to Prior Work

vs. Tool-based: Enigmata trains internal reasoning capabilities rather than relying on external engines
vs. Single-task RLVR: Enigmata uses 36 diverse tasks to prevent overfitting and encourage general logical reasoning
vs. Existing Datasets (PB-Path, etc.): Enigmata provides generators for infinite data scaling, whereas others are static [not cited in paper]

Limitations

RLVR requires problems with strictly verifiable answers; cannot easily apply to open-ended creative tasks
Generator-based training may still risk overfitting to the specific algorithmic patterns of the generators if not sufficiently diverse

Reproducibility

Code: https://seed-enigmata.github.io/

Project page (https://seed-enigmata.github.io/) is available. The dataset (Enigmata-Data) and evaluation benchmark (Enigmata-Eval) are released. Code for generators and verifiers is implied to be part of the suite. Trained model weights (Qwen2.5-32B-Enigmata) are mentioned but specific download links for weights are not in the text snippet.

📊 Experiments & Results

Evaluation Setup

Evaluation on puzzle and math benchmarks using pass@k metric

Benchmarks:

Enigmata-Eval (Logical Puzzles (36 types)) [New]
ARC-AGI (Abstract Reasoning / Pattern Recognition)
AIME (2024-2025) (Advanced Mathematics)
GPQA (Diamond) (Graduate-Level Science QA)

Metrics:

Pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ARC-AGI	Accuracy	Not reported in the paper	32.8	Not reported in the paper
ARC-AGI 2	Accuracy	Not reported in the paper	0.6	Not reported in the paper

Main Takeaways

Qwen2.5-32B-Enigmata consistently surpasses SoTA reasoning models (o1, o3-mini-high) on Enigmata-Eval and generalizes well to ARC-AGI.
Training on diverse puzzles (Enigmata) yields a 'free lunch' improvement on specialized Math (AIME) and STEM (GPQA) tasks, suggesting deep logical reasoning is a transferable skill.
Multi-stage RL (curriculum training) helps prevent forgetting and is crucial for learning difficult tasks like ARC-AGI alongside easier puzzles.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Large Language Models (LLMs)
Chain-of-Thought (CoT) Reasoning

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—training models using objective success signals (e.g., correct answer) rather than human preference labels

Rejection Fine-Tuning (RFT): A training phase where the model is supervised on its own best-generated solutions (verified as correct) to establish foundational reasoning patterns

VC-PPO: A variant of Proximal Policy Optimization (PPO) used for training with verifiable outcome rewards

ARC-AGI: Abstraction and Reasoning Corpus—a benchmark testing general intelligence through pattern recognition and logic tasks, known for being hard for LLMs

Seed1.5-Thinking: A Mixture-of-Experts (MoE) model with 20B activated and 200B total parameters used as a scaling base in the paper

MoE: Mixture-of-Experts—a model architecture where different sub-networks (experts) are activated for different inputs, allowing large capacity with lower inference cost