Evaluation Setup
Evaluation on puzzle and math benchmarks using pass@k metric
Benchmarks:
- Enigmata-Eval (Logical Puzzles (36 types)) [New]
- ARC-AGI (Abstract Reasoning / Pattern Recognition)
- AIME (2024-2025) (Advanced Mathematics)
- GPQA (Diamond) (Graduate-Level Science QA)
Metrics:
- Pass@k
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ARC-AGI |
Accuracy |
Not reported in the paper |
32.8 |
Not reported in the paper
|
| ARC-AGI 2 |
Accuracy |
Not reported in the paper |
0.6 |
Not reported in the paper
|
Main Takeaways
- Qwen2.5-32B-Enigmata consistently surpasses SoTA reasoning models (o1, o3-mini-high) on Enigmata-Eval and generalizes well to ARC-AGI.
- Training on diverse puzzles (Enigmata) yields a 'free lunch' improvement on specialized Math (AIME) and STEM (GPQA) tasks, suggesting deep logical reasoning is a transferable skill.
- Multi-stage RL (curriculum training) helps prevent forgetting and is crucial for learning difficult tasks like ARC-AGI alongside easier puzzles.