← Back to Paper List

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang
ByteDance Seed, Fudan University, Institute for AI Industry Research (AIR), Tsinghua University, Nanjing University, Shanghai Jiao Tong University
arXiv (2025)
Reasoning RL Benchmark

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Logical Reasoning Synthetic Data Generation
Enigmata improves LLM reasoning by generating unlimited, verifiable logic puzzles for reinforcement learning, demonstrating that puzzle-solving skills transfer to complex math and STEM domains.
Core Problem
Large Reasoning Models (LRMs) like o1 excel at domain-specific tasks (math/code) via verifiable rewards but struggle with pure logic puzzles that lack training data and automated verification.
Why it matters:
  • Current puzzle datasets lack the scale and programmatic control needed for modern RL pipelines, relying on limited static examples
  • Puzzles test pure reasoning capabilities orthogonal to memorized knowledge, making them ideal for improving general intelligence
  • Existing methods rely on tool-use (code interpreters) rather than improving the model's internal reasoning chain
Concrete Example: In a 'Twiddle' puzzle (arranging numbers by rotating sub-grids), a standard model might hallucinate the grid state after a move because it lacks internal simulation capabilities. Enigmata forces the model to learn the logic by verifying the final arrangement against a programmatic rule.
Key Novelty
Enigmata Suite (Data + Model Recipe)
  • Pairs every puzzle task with a Python generator (for infinite, controllable difficulty data) and a rule-based verifier (for instant RL rewards)
  • Proposes a 'Mix-training' and 'Multi-stage' RL recipe that combines rejection fine-tuning with curriculum-based reinforcement learning to prevent catastrophic forgetting
Evaluation Highlights
  • Qwen2.5-32B-Enigmata surpasses o3-mini-high and o1 on the Enigmata-Eval benchmark (specific scores not reported in text)
  • Achieves 32.8% on ARC-AGI and 0.6% on ARC-AGI 2, demonstrating strong transfer to abstract reasoning tasks
  • Scaling to Seed1.5-Thinking (200B MoE) boosts performance on AIME (2024-2025), BeyondAIME, and GPQA Diamond using only knowledge-orthogonal puzzle data
Breakthrough Assessment
8/10
Provides a scalable, verifiable framework for reasoning training that generalizes to math/STEM. The 'free lunch' transfer from synthetic puzzles to advanced math is a significant finding.
×